I’m sure most of us have seen multiple videos or read various blogs anointing the new OpenAI Assistants API (AA) as the “RAG Killer”. But I say, “Not so fast!”
I have been trying to keep abreast of developers progress in using the AA and I keep seeing the same request (in one form or another): How do I add more files?
I start thinking: How can we bypass the current file restrictions in AA? The first thing that comes to mind, is to use RAG with AA – that is, send your prompt to a vector store, retrieve the context, and send this context along with the question to the AA. This then becomes part of the prompt history.
The good news is that the model will now retain the “context” of your discussion “thread” so far. The bad news is that if you have a follow up question, the model is going to charge you tokens for the new question and context, plus charge you again for the previous question and context which, as I understand it, it is maintained in the thread up until a certain point. This will get wildly expensive very quickly.
So, it seems like the thing that would make this work is one more field in the message payload: “context”. Follow me:
You have as input:
- User question
I’m saying, suppose we could add an additional “Context” field? A field where we could put additional information we want the model to consider along with the User prompt and file(s) if any.
So now, no matter how many files you have in your dataset, you can send just the relevant context to the model and it has all the information it needs to complete an answer.
But, you ask, what about the thread history and the massive number of tokens caused by the context?
I say give us the option of ONLY storing the user prompt and model response as history and NOT the context in the thread memory. We don’t need the previous context for the next question because we already have the model response. And, whatever new context we need for the next question, we can get in a new vector store retrieval. Basically, it works like RAG, only we make the AA more flexible (and less expensive) by telling it what NOT to store (and charge additional tokens on).
To recap, we now send the model:
- User question
It processes the information and returns a response. Everything is stored in the “memory” except the Context. I mean, make it an option in the AA call to store or remove.
This allows us to use RAG with AA and overcome the file limitations without incurring massive token usage costs. Best of both worlds!
What do you think?