Large codebase as knowledge for GPT-4-Turbo

I want to build a chatbot that can reason over a fairly large codebase for a specific software system and answer support or developer related questions with the entire source code as context. With the current context length limits, what is the best way to build this? Is it still using a vector database to pull in relevant parts into the prompt? Or are there other, better options? I can use GPT-4 Turbo model for this project.

Hi and welcome to the Developer Forum!

To do this you could use either GPTs which fall under the ChatGPT no-code required system, or you can build an assistant, which would allow customisation and more flexibility.

I am fine with building an assistant. What kind of size of knowledgebase (in this case source code) can I expect the assistant to be able to access effectively? What limits are there?

There is a 10Gigabyte total file size limit and a maximum of 20 files.

I have not tested the effectiveness of retrievals on code, but this is using the latest Microsoft Ai search… hopefully you get good results in testing.

I was curious about your experience in building an assistant. Did you find the exercise to be helpful? What size of codebase did you target in terms of lines of code?
I am looking to build something similar and would like to augment the assistant with RAG on the documentation for the codebase.

I ended up using the llamaindex python module, where I created three different indexes. One based on the source code, one based on systems documentation, and a last one based on an API schema definition (graphql). Then I sent three parallel queries, one using each index, and a fourth query to gpt-4 by combining the three responses with a prompt to summarize all of it. It seemed to work pretty well but the latency was pretty high. I never evaluated how well it worked for a wider range of queries, this was just a quick test. The codebase was around 70 kLOC kotlin.

1 Like

Hey Ragnar,

That sounds awesome!
I assume the final response was a text summarising the codebase. It’s good that it had an API. Apart from chatting with the codebase, what could be some use cases of this RAG pipeline? Could you use this to generate a new feature in the codebase?

Researching for the same and found a good chunking logic for code by sweep dev. You should checkout their docs .