Assistants API and RAG - Best of Both Worlds?

I’m sure most of us have seen multiple videos or read various blogs anointing the new OpenAI Assistants API (AA) as the “RAG Killer”. But I say, “Not so fast!”

I have been trying to keep abreast of developers progress in using the AA and I keep seeing the same request (in one form or another): How do I add more files?

I start thinking: How can we bypass the current file restrictions in AA? The first thing that comes to mind, is to use RAG with AA – that is, send your prompt to a vector store, retrieve the context, and send this context along with the question to the AA. This then becomes part of the prompt history.

The good news is that the model will now retain the “context” of your discussion “thread” so far. The bad news is that if you have a follow up question, the model is going to charge you tokens for the new question and context, plus charge you again for the previous question and context which, as I understand it, it is maintained in the thread up until a certain point. This will get wildly expensive very quickly.

So, it seems like the thing that would make this work is one more field in the message payload: “context”. Follow me:

You have as input:

  • Instructions
  • User question
  • File(s)

I’m saying, suppose we could add an additional “Context” field? A field where we could put additional information we want the model to consider along with the User prompt and file(s) if any.

So now, no matter how many files you have in your dataset, you can send just the relevant context to the model and it has all the information it needs to complete an answer.

But, you ask, what about the thread history and the massive number of tokens caused by the context?

I say give us the option of ONLY storing the user prompt and model response as history and NOT the context in the thread memory. We don’t need the previous context for the next question because we already have the model response. And, whatever new context we need for the next question, we can get in a new vector store retrieval. Basically, it works like RAG, only we make the AA more flexible (and less expensive) by telling it what NOT to store (and charge additional tokens on).

To recap, we now send the model:

  • Instructions
  • User question
  • File(s)
  • Context

It processes the information and returns a response. Everything is stored in the “memory” except the Context. I mean, make it an option in the AA call to store or remove.

This allows us to use RAG with AA and overcome the file limitations without incurring massive token usage costs. Best of both worlds!

What do you think?

5 Likes

@SomebodySysop, Yes that’s how i did it in the GPT and it works perfectly using the schema. However what I don’t know how to do is, how to convert my working GPT that successfully uses my own Flask via the GPT API, with a mix of files for instructions, and then using the RAG model. So I have no document limits for my ‘knowledge’, to make the Assistant work as a standalone AI Assistant and so I just want to now create and use a Python Flask to do what the GPT is doing with my API.

Therefore key to making this work, I need the Flask to be able to chain chat completions together in exactly the same way (memory) that the GPT does and response the same with text streaming. Is this possible, I assume it is but I have not seen single example using a REST API to provide context nor in RAG format which is how my Flask is currently working to return context to the GPT and that works really well.

I do not know anything about Flask. It looks like it’s some sort of no-code solution t linking API calls to a website.

Having built 3 RAG applications so far, I can say yes, not only is it possible but I would assume it is exactly what the vast majority of developers are doing today.

This never seems to get old: https://youtu.be/Ix9WIZpArm0?si=tKIb0RzffnU-3UPe

1 Like

Yep I’m with you. Yes I have used Flask now with RAG and it works great with davinci not GPT 3.5 turbo and using the latest openai 1.3.7 version and updated langchain for similarity search of a ChromaDB vectorstore and lama-hub for a data loader. Just 3 secs response with summarization of up to 5 sentences, vs around avg of 8 secs with GPT3.5 turbo for a specific q&a.

Some of this may be outdated now with V2 and Vector Stores. Sadly we can’t control chunking, overlap etc… but it seems to be the direction that OpenAI heads into.

Could you elaborate a bit on what you mean by this?

Which is why the concept of RAG, where you are able to control these and more, isn’t nearly as outdated as many may think.

There are those of us who are actively working on improving the embedding methodologies for RAG: Using gpt-4 API to Semantically Chunk Documents

That now we have a different schema for interacting with indexes created from our files. The new model is more flexible in that it allows you to create multiple, separate indexes and decide when do you want to enable one or more for retrieval. Though, still, they enforce a set chunking (I think 500 tokens) and overlap (50%). So building your retrieval separately is still a good idea for most use cases.

Absolutely!, RAG will be with us in different forms as much as the context grows. RAFT is the best solution at the moment to enjoy both short and long term memory.

1 Like

I was going to ask “What is RAFT?” https://www.datacamp.com/blog/what-is-raft-combining-rag-and-fine-tuning

I remember asking the question a year ago: Are there any advantages to combining RAG with fine-tuning? Now I have my answer.

1 Like