How do you make rag without blowing costs?

Genial · November 3, 2023, 8:25am

Hello everyone, I’m coming back to your expertise with another problem. I recently heard about RAG. I thought it would be perfect for a small Agenda bot I’m trying to develop.

I’m not an AI expert, so feel free to deconstruct my reasoning if it’s wrong.

In my idea, the RAG would allow me to provide a file containing thousands of upcoming events. I could then say to ChatGPT: I like jazz music and outdoor places, I live in Boston. And ChatGPT would be able to give me the events that best match my request.

The first problem I see is the cost. If I’ve understood correctly, my entire file or its chunks will be provided in the context of ChatGPT, which means that my input will have a lot of tokens. Let’s say that my event list represents 32k tokens, which is the limit of GPT4. The price of GPT4 in 32k is $0.06 / 1K tokens This would mean that each input would cost me $1.96.

I think there must be an error in my reasoning, otherwise RAG would be a very expensive technique.

Thanks for your help.

hessel · November 3, 2023, 9:50am

Yes, gpt-4 32K tends to become very expensive if you just stuff the prompt with lots of information that is irrelevant. Thats why RAG comes in very handy.

The idea of RAG is that you only retrieve the chunks that are needed to allow the LLM to complete towards a good answer. So in your case, you would have to create chunks in a logical way and use vector search (or just filters) to only retrieve the chunks that are needed. The chunks that contain events in Boston labelled Jazz and outdoor, you will feed to gpt and ask it to create a nice agenda.

There are other options but in the end it is all about retrieving only the relevant content for an answer and feed that to the beast

jwatte · November 3, 2023, 2:45pm

First, you’d take the corpus of information, and chunk it and index it, and then you’d use the RAG to select a small subset of the information to provide in the prompt. That’s the whole point of “retrieval” – find the appropriate bits, don’t jam everything in there.

Second, it sounds like your use case isn’t really a GPT use case at all. Regular parametric search is much better done with a standard database, and if you want “similar to” type matches, you can use the ADA 002 text embedding model to generate similarity vector scores. This is the first half of RAG – the “retrieval” – but you don’t need to then jam it through GPT-4 because you already have the things you want to return to the user.

saurabhit1127 · January 9, 2024, 7:35am

How to index it , if the prompt answer is not present in the first chunk then need to iterate over next chunck, any other efficient way to reduce the token cost

jiang.chen · February 28, 2025, 4:00am

There might be a lot of ways to reduce cost, but I guess the first step is to understand where the cost comes from. We built a convenient calculator to estimate the cost of different parts of building a RAG pipeline, including chunking, embedding, vector storage/search, and LLM generation. It helps you identify cost-saving opportunities.

Feel free to take a look!

Topic		Replies	Views
Minimizing Costs in RAG Application API	3	10717	December 15, 2023
Building a RAG App (as a noob) API gpt-4 , rag	8	630	November 21, 2024
How knowledge base files are handled (Assistants API) API assistants-api	14	8078	February 8, 2024
RAG or Fine tuning for a domain specific QA chatbot API rag , development , chatbot , assistants-api	4	1425	July 3, 2024
Using OpenAI API and RAG to extract specific information from scraped website text Community api , rag	3	2248	April 26, 2024

How do you make rag without blowing costs?

Related topics