Large codebase as knowledge for GPT-4-Turbo

rrva · November 16, 2023, 4:01pm

I want to build a chatbot that can reason over a fairly large codebase for a specific software system and answer support or developer related questions with the entire source code as context. With the current context length limits, what is the best way to build this? Is it still using a vector database to pull in relevant parts into the prompt? Or are there other, better options? I can use GPT-4 Turbo model for this project.

Foxalabs · November 16, 2023, 4:05pm

Hi and welcome to the Developer Forum!

To do this you could use either GPTs which fall under the ChatGPT no-code required system, or you can build an assistant, which would allow customisation and more flexibility.

https://platform.openai.com/docs/assistants/how-it-works/run-lifecycle

rrva · November 16, 2023, 4:12pm

I am fine with building an assistant. What kind of size of knowledgebase (in this case source code) can I expect the assistant to be able to access effectively? What limits are there?

Foxalabs · November 16, 2023, 4:17pm

There is a 10Gigabyte total file size limit and a maximum of 20 files.

I have not tested the effectiveness of retrievals on code, but this is using the latest Microsoft Ai search… hopefully you get good results in testing.

romit.chakraborty · February 2, 2024, 7:25pm

I was curious about your experience in building an assistant. Did you find the exercise to be helpful? What size of codebase did you target in terms of lines of code?
I am looking to build something similar and would like to augment the assistant with RAG on the documentation for the codebase.

rrva · February 2, 2024, 10:04pm

I ended up using the llamaindex python module, where I created three different indexes. One based on the source code, one based on systems documentation, and a last one based on an API schema definition (graphql). Then I sent three parallel queries, one using each index, and a fourth query to gpt-4 by combining the three responses with a prompt to summarize all of it. It seemed to work pretty well but the latency was pretty high. I never evaluated how well it worked for a wider range of queries, this was just a quick test. The codebase was around 70 kLOC kotlin.

romit.chakraborty · February 8, 2024, 5:37pm

Hey Ragnar,

That sounds awesome!
I assume the final response was a text summarising the codebase. It’s good that it had an API. Apart from chatting with the codebase, what could be some use cases of this RAG pipeline? Could you use this to generate a new feature in the codebase?

lakshyadhariwal · February 17, 2024, 1:24pm

Researching for the same and found a good chunking logic for code by sweep dev. You should checkout their docs .

Topic		Replies	Views
Teaching GPT the information it will be working on API gpt-4 , assistants	8	2231	November 19, 2023
Strategy for building context for large source code project Community chatgpt	1	341	February 12, 2025
Ways to deal with prompts larger than model's context length Prompting gpt-4	3	1969	July 6, 2024
GPT + vector DB no good for understanding new code bases API	10	4174	June 4, 2023
How to Add Knowledge Base in API API api	12	22176	December 15, 2023

Large codebase as knowledge for GPT-4-Turbo

Related topics