GPT + vector DB no good for understanding *new* code bases

Regarding getting GPT-4 to understand large, new code bases, there seems to be a fundamental problem with this task, due to the 8192 tokens limit of GPT-4 (8k). That is enough for maybe a couple classes of code, but that’s it. I’m use FAISS in memory vector db to store the code base but that doesn’t help at all. Yes, searches are performed for relevant snippets which are added to the context and passed in the prompt, but that’s not nearly good enough, as the model still fundamentally can’t have an understanding of the entire code base all at once. WTF? The whole, “use a vector db to understand large corpuses of new information” is fundamentally flawed. This technique is only good for quick, simple fact-based queries, where the context fits into 8k tokens :roll_eyes: Is there a “trick” for this us-case? (I can think of a few possible “solutions” but they are only marginally better, while being a lot more trouble.)

Wow those are some big monoliths!

At this point, either break the code into smaller chunks before you embed, or wait for GPT-4 32k.

Or think of your problem in much smaller bite size levels.

I’m curious why you need so much context …

I unfortunately didn’t have the forsight myself, but a simple mental experiment reveals that this entire concept of using a vector database to be able to ask various questions on a code base is not going to work.

If you want to ask some dumb dumb stuff that you can easily look up like what are the required parameters to a particular class constructor, sure that will work.

However, if you want to actually ask useful things that you can’t look up in 2 seconds, that you would ask somebody who’s very experienced in that particular code base, then you can’t do that because the model will never have the context needed. A very simple example of this type of a query would be, “How many agent classes are there in the langchain repo?” (Assuming your vector database was created on the langhain repo source files). The more sophisticated general question that requires the context of the entire repo would be something like, “given my code below, what are some issues and improvements with it?” “What is the code path of parameter X as it enters the input of API Y, and passes through the control plane?” There are so many general, useful questions that could be asked based on a code base that GPT-4 is not aware of, that require context of the entire code base.

1 Like

Have you tried using a discriminator model to validate certain context after the embedding?

Based on what you are trying to gather, which is high level questions and interactions within the codebase, I agree that embeddings alone are NOT going to help.

The closest thing you could try is use an auto documentation engine, such as Doxygen. This will build a static webpage containing a high level diagram of the interactions and the ability to click around “surf” using HTML links to the various related pieces of code.

The fundamental reason why it won’t work for you is that the AI is not graph structured. It is structured more like a stochastic parrot, that is basically a fancy auto-completion engine. So there is no gozindas/gozoutas concept, which is important for code. But Doxygen (or similar) does map this, and gives you all the correct context.

The “use a vector db to understand large corpuses of new information” is more in the context of text/words, not code, given the current situation of how these LLM’s work.


you can use a map reduce search with a tool like langchain I tried it recently and now it breaks things and goes piece by piece through documents from the vectorstore I am using. But I always do eventually run of context when it comes to the final steps.

I totally understand your concerns about the limitations of GPT models, particularly GPT-4, when it comes to understanding and reasoning about large codebases due to the token limit constraint. I have been working on a project for a bit now that integrates the gpt models, and embeddings api to provide more advanced developer tooling, code generations, and code editing. While it’s true that providing the model with complete context for sufficiently large codebases using single-shot methods is not feasible, there are alternative approaches that can help overcome this limitation.

One such approach I have been having success with involves using multi-agent setups and reflection techniques to break down the task into multiple prompts. The goal is to iteratively aggregate and compress more and more relevant information until the model has sufficient context to answer the original query. This process is analogous to how a developer learns a new large codebase, processing it one file at a time and slowly accumulating more and more relevant context to sufficiently solve a given problem at hand.

Here’s a high-level overview of how this approach works in practice:

  1. Divide the task: Have the model Break down the original query into smaller, manageable sub-tasks that can be addressed within the token limit constraints.
  2. Iterative querying: Have the GPT model generate questions that it thinks will be helpful for each sub-task, gradually building up context and knowledge about the codebase.
  3. Reflection: Have a separate gpt agent evaluate the generated responses to identify gaps in understanding or areas where further clarification is needed.
  4. Aggregation and compression: Combine the gathered information from previous steps, having gpt filter out irrelevant details and compressing the context to fit within the token limit.
  5. Final response: After going through the process above for a few loops, and the model feels sufficient context has been accumulated, have the GPT model generate a comprehensive response to the original query.

This approach requires careful planning and execution, but it can help address the challenges posed by the token limit constraint while still leveraging the power of GPT models for understanding and reasoning about complex, large codebases and producing code.

It’s important to note that this method may not be perfect and could still have limitations, but it offers a more viable solution compared to single-shot methods. The variety and quality of the vector snippets is really important. I parse my entire codebase as well as the code bases of project dependencies and traverse the AST to provide varying scopes of code snippets as well as documentation. Obviously this introduces other problems like greater cost since multiple api trips are required. Hopefully as GPT models continue to evolve and improve, we can expect better performance and capabilities in handling larger prompts and more complex queries


Thanks for the detailed response @dakotamurphyucf .

I actually use steps one two four and five for another project. Reflection is a good idea. It would still seem to mostly just work on a certain subset of types of questions about a large code base. Broad realizations, like, “Oh, these classes should be wrapped up in put under an interface” (and lots of other examples that require simultaneous context of the whole code base) would still be out of context. I’ll play around with it see if I come up with any ways around the issue.

Isn’t FAISS a vector index rather than true database (hence your problem)?

One of the big mistakes people keep making is thinking that GPT can think. It really can’t. Remember, its a probalistic model. Its good at instructions and its good at regurgitating information but not thinking.

Any reasoning and thinking comes from the instructions and the natural flow of language itself. So, you might have to make a series of instructions that tells the AI how to think about the user request.

I’ve noticed this specifically with code and other complex tasks that involve novel logic or reasoning scenarios. So, you might have to break tasks down into more common instructions first.

1 Like

The rumor about a future version being able to read up to use million tokens as context. I don’t know if that’s true, but certainly we will see bigger context sooner or later. I think there is a lot of space for algorithm and big O improvements in that area.