Would using embeddings like this work in theory?

OpenAI actually has an example of code-search using embeddings in Python. You could mimic this to pull in all your code and embed it.

Then you would feed the relevant content to the LLM, using a standard RAG approach, like this other cookbook:

Note that you need to give the LLM an out, and allow it to respond appropriately if the retrieved information is not relevant to the question. So allow it to say “I could not find an answer.” See cookbook for an example of how to do this.

The other feedback would be your mod 2 implementation of assistant/user pairs. Sometimes the user can send two or more questions before the next assistant, so you may want to be more explicit, and pull the assistant/user streams sorted by timestamp, and not use interleaving assumptions … just to be safe.

The vectors and text you embed would be in a database, and you would store the hash of the text as the key into the database, and search the vectors as a linear binary search using python and numpy for the best performance. Then get the hashes from the top vector matches (positionally) and index into the DB to get the text.

However, you can even ditch the DB and have all text and vectors in separate arrays if you have enough memory (works for most small RAGs and is the fastest option most likely).

I have found vectorized code in numpy isn’t needed for search, just simple for-loop linear is fastest, but feel free to benchmark different implementations to see what is fastest in your environment.