How do you make rag without blowing costs?

Hello everyone, I’m coming back to your expertise with another problem. I recently heard about RAG. I thought it would be perfect for a small Agenda bot I’m trying to develop.

I’m not an AI expert, so feel free to deconstruct my reasoning if it’s wrong.

In my idea, the RAG would allow me to provide a file containing thousands of upcoming events. I could then say to ChatGPT: I like jazz music and outdoor places, I live in Boston. And ChatGPT would be able to give me the events that best match my request.

The first problem I see is the cost. If I’ve understood correctly, my entire file or its chunks will be provided in the context of ChatGPT, which means that my input will have a lot of tokens. Let’s say that my event list represents 32k tokens, which is the limit of GPT4. The price of GPT4 in 32k is $0.06 / 1K tokens This would mean that each input would cost me $1.96.

I think there must be an error in my reasoning, otherwise RAG would be a very expensive technique.

Thanks for your help.

1 Like

Yes, gpt-4 32K tends to become very expensive if you just stuff the prompt with lots of information that is irrelevant. Thats why RAG comes in very handy.

The idea of RAG is that you only retrieve the chunks that are needed to allow the LLM to complete towards a good answer. So in your case, you would have to create chunks in a logical way and use vector search (or just filters) to only retrieve the chunks that are needed. The chunks that contain events in Boston labelled Jazz and outdoor, you will feed to gpt and ask it to create a nice agenda.

There are other options but in the end it is all about retrieving only the relevant content for an answer and feed that to the beast


First, you’d take the corpus of information, and chunk it and index it, and then you’d use the RAG to select a small subset of the information to provide in the prompt. That’s the whole point of “retrieval” – find the appropriate bits, don’t jam everything in there.

Second, it sounds like your use case isn’t really a GPT use case at all. Regular parametric search is much better done with a standard database, and if you want “similar to” type matches, you can use the ADA 002 text embedding model to generate similarity vector scores. This is the first half of RAG – the “retrieval” – but you don’t need to then jam it through GPT-4 because you already have the things you want to return to the user.


How to index it , if the prompt answer is not present in the first chunk then need to iterate over next chunck, any other efficient way to reduce the token cost