Sharing my latest project called Codepilot. Codepilot is your new programming buddy and is basically GitHub Copilot on Steroids. It’s an expert on your projects codebase. It leverages Vectra, my local Vector DB, to maintain an index of your projects code that gets checked in right alongside the rest of your code. You’ll need an OpenAI key but that’s it. It walks you through all the setup process and you currently interact with it from a terminal window (VSCode extension coming.) It uses gpt-3.5-turbo-16k
by default but you can configure it to use any chat completion based model. The ability to create files, modify existing files, create branches, PRs, and merge requests are all coming soon.
Oh lord, I’ve been watching it, but then all of a sudden the whole thing is full of code now xD. You already have a coder bot making stuff for you huh?
But for real, your projects are pretty cool, so I will take a look in a bit when work cools down.
Sounds cool. Can you explain how it is better/different than co-pilot?
I’ve been using it to build itself which is the ultimate test right?
GH Copilot only sees a handful of your files so it doesn’t truly know your codebase. Codepilot maintains a local Vector DB that’s per/branch that’s an index of your entire codebase. Every branch has its own Vector DB that’s an index of the current branch. It also uses a feature I call Document Sections to do a better job of showing code to the model. I have a bunch of other features coming that will make it far surpass the capabilities of Codepilot but it’s already pretty good.
I asked Codepilot to add a new chat function using Codepilot.addFunction, that lets users write code to files when chatting with Codepilot. The function name should be called "createFile"
Here’s what it came back with:
Sure! Here’s an example of how you can add a new chat function using
Codepilot.addFunction
:import { ChatCompletionFunction } from "alphawave"; import { Codepilot } from "codepilot"; // Define the schema for the chat function const writeCodeFunction: ChatCompletionFunction = { name: "writeCode", description: "Write code to a file", inputs: [ { name: "filePath", type: "string", description: "The path to the file where the code will be written" }, { name: "code", type: "string", description: "The code to write to the file" } ], outputs: [ { name: "success", type: "boolean", description: "Indicates whether the code was successfully written to the file" } ] }; // Define the implementation of the chat function const writeCodeImplementation = async (args: any): Promise<any> => { const { filePath, code } = args; try { // Write the code to the file await fs.writeFile(filePath, code); return { success: true }; } catch (error) { return { success: false }; } }; // Add the chat function to Codepilot Codepilot.addFunction(writeCodeFunction, writeCodeImplementation);
With this code, you can now use the
writeCode
function in your chat conversations. For example:User: Can you write the following code to "src/index.js"? Codepilot: Sure, I can help with that. Please provide the code you want to write. User: function greet() { console.log("Hello, world!"); } Codepilot: Alright, I will write the code to "src/index.js". Give me a moment. [Codepilot writes the code to the file] Codepilot: I have successfully written the code to "src/index.js".
I hope this helps! Let me know if you have any further questions.
It’s actually really close… It got the schema wrong but I would attribute that to it not seeing the schema and there are 0 examples of functions in the codebase at this point. It got the signature for the function correct and honestly once it sees a few examples I would expect it to be spot on.
As I suspected… I tweaked the models implementation of createFiles
, re-indexed, and then asked it to create a new chat function that modifies existing files
. It’s spot on implementation wise now that it’s seen an example:
Sure! Here’s an example of how you can create a new chat function that modifies existing files:
import { ChatCompletionFunction } from "alphawave"; import { Codepilot } from "../Codepilot"; import * as fs from "fs/promises"; /** * Schema for a function that modifies an existing file. */ const modifyFileFunction: ChatCompletionFunction = { name: "modifyFile", description: "Modifies an existing file at the specified path", parameters: { type: "object", properties: { filePath: { type: "string", description: "The path to the file to modify" }, code: { type: "string", description: "The code to insert into the file" }, position: { type: "number", description: "The position in the file where the code should be inserted" } }, required: ["filePath", "code", "position"] } }; /** * Adds the modifyFile function to the codepilot instance. */ export function addModifyFile(codepilot: Codepilot): void { codepilot.addFunction(modifyFileFunction, async (args: any) => { const { filePath, code, position } = args; try { // Read the existing file contents const fileContents = await fs.readFile(filePath, "utf-8"); // Insert the code at the specified position const modifiedContents = fileContents.slice(0, position) + code + fileContents.slice(position); // Write the modified contents back to the file await fs.writeFile(filePath, modifiedContents); return `Successfully modified file at ${filePath}`; } catch (error) { return `Failed to modify file at ${filePath}`; } }); }
You can add this
modifyFile
function to your existing codepilot instance by calling theaddModifyFile
function and passing in the codepilot instance as an argument.
Modifying existing files isn’t that simple but it’s on the right track with including the position offset of where to make the change. I honestly didn’t expect it to get that close.
This reads like a really cool project!
I will definitely spend some time with this.
Thanks for sharing!
I added one of my games to copilot but encountered an error:
'vectra create
creates a new code index
Options:
–version Show version number [boolean]
–help Show help [boolean]
-k, --key OpenAI API key to use for generating embeddings and querying
the model. [string] [required]
-m, --model OpenAI model to use for queries. Defaults to
“gpt-3.5-turbo-16k”. [string] [default: “gpt-3.5-turbo-16k”]
-s, --source source folder(s) to index. [array] [required]
-e, --extension extension(s) to filter to. [array]
Error: Error generating embeddings: The embeddings API returned an error status of 500: Internal Server Error
at LocalDocumentIndex. (C:\Users\VeitB\AppData\Roaming\npm\node_modules@stevenic\codepilot\node_modules\vectra\lib\LocalDocumentIndex.js:217:27)
at Generator.next ()
at fulfilled (C:\Users\VB\AppData\Roaming\npm\node_modules@stevenic\codepilot\node_modules\vectra\lib\LocalDocumentIndex.js:28:58)
at process.processTicksAndRejections (node:internal/process/task_queues:95:5)’
Does this mean, I should repeat the process?
Should I remove all “unnecessary” files from the directory before running codepilot?
by the way: the images where embedded alongside the code and this did take quite a long time.
So in the short run you can use “codepilot add -e (extension) “ to filter the index just to your code files.
What’s the extension of your images? I’ll add some logic to skip images. I currently just assume you want to index everything.
Thank you for your reply!
I currently just assume you want to index everything.
That is a valid approach. I don’t know why I was expecting that “everything” would mean only code files.
For testing I excluded the images and rebuild the index and of course we are talking 95% of the time and 98% of the cost here.
If you come to believe that it will be valuable, I propose to only “include all” if requested and otherwise filter according to wikipedia for images, videos and sound files for example using a list like this ‘Image file format - Wikipedia’
I’m adding logic to exclude images now (I’ll add sound & video extensions.) I’m actually trying to get Codepilot to write the code. It wrote it but it’s a bit heavy handed so working with it to get it to do it more in the style that I used.
UPDATE:
I just published version 0.3.0 which adds logic to ignore indexing a wide range of binary files (let me know if an extension is missing) and it also adds the ability to create files.
There’s definitely an art to getting Codepilot to generate useful code (I see lots of areas where I can improve things) but once you land on the code you want you can simply ask it to “write that to a file” and it will add a new file to your project. You then need to run codepilot rebuild
to add it to the index but I’ll fix that in the next version.
Modifying files is super tricky because the entire file may not fit into the the models context window and I don’t want to add any feature that does more harm then good. I already had to add logic to the createFile()
function to ensure that the model is only using it to create files that doesn’t exist yet.
UPDATE:
Version 0.4.0 is published. File creation now adds any missing folders and adds the new file to the code index.
One area that needs further improvement is the logic to select the source code I show the model. You can ask Codepilot show me the code for the createFile function
and while the answer looks convincing it’s not the correct code. The issue is that my Document Sections algorithm will sometimes render gaps in its output and I’m not always showing the model whole methods/functions. This is where Vectra would really benefit from a secondary TF/IDF re-ranking pass but I feel like I probably just need to update my Document Sections algorithm to favor more contiguous text spans over sparse text spans. As I said, lots of room for improvement.
Great Codepilot moment… I’m working with it to write its own usage guide and I was asking it to document how to add/remove source folders:
It wrote a really good explanation but you currently need to run codepilot rebuild
anytime you change settings. I expected it to update the text but instead it spit out the code for adding a new watch
feature to Codepilot
Because wouldn’t it just be better to add the ability to automatically re-index files
I’m leaning towards just having Codepilot index everything in the repo by default and using a new .llmignore
file to define the folders and files that get ignored. Thoughts?
I believe that makes sense.
There are so many different reasons why folders or files should not be indexed and it’s not possible to account for all of these in advance.
You can put a “nice to have” on your to-do list by informing the user about the tokens per folder so that even lesser experienced ones can make better decisions.
So, to clarify it would actually mean to expand the process of the initial indexing by interacting with the user a bit more in-depth.
I’m thinking that when you run Codepilot on a repo it could drop a default .llmignore
file alongside the .codepilot
folder it creates. Good idea on educating users on token counts and costs.
I was originally thinking about a more interactive first run experience but I just wanted to get Codepilot out the door so got a little lazy yesterday
I could actually run an assessment that estimates the token count for the index… I know very accurately the number of tokens I’m about to index.
Yet another cool addition to the toolbelt, nice!
Maybe just simple a .llm file that is a JSON/YAML for instructions? Ignores, instructions, etc? And then it can continuously be added to and not be confined to “ignores”?
Lately all of my coding has been in ReactJS and so I keep my components encapsulated (excluding providers). I would like to be able to set the focus. So if I am working on Component B, it knows to only look in Component B-related material that I have described.
Admittedly I haven’t used this & maybe your .codepilot already does this.
I almost exclusively use copilot to autofill tedious things for me like console messages. They just came out (for me) the “Fix This” and it just… kind of stinks, for me.
I am looking forward to the VSCode extension!
Thanks! Copilot is an awesome tool, I love it. I’m just trying to see if I can’t level up things with Codepilot. It has a ways to go… I’m trying to use it to write itself as that’s the ultimate test in my mind. I just caught it adding files that I didn’t want it to add so like I said, it has a ways to go.
There’s definitely a bit of riffing that you have to do with Codepilot but I already find that it’s sparking new ideas in me for how to approach problems. It’s definitely not there yet but it seems very promising.
How do you handle the finite context size of the LLM?
By chunking the code (along logical boundaries) and providing a code index and map (multi-level summaries), right? Is that enough to let the LLM do any real work on projects much longer than its context size?
(I’m the author of the AskYourCode ChatGPT plugin)