I am building a RAG chatbot using the File Search Assistant to answer questions based on files I uploaded (not user-uploaded files). I’ve noticed that handling a thread with 10 user questions costs around $0.20. This cost can escalate significantly with a large user base (e.g., 10,000 users with more than 10 messages each).
Here are the specifics of my current setup:
Model: GPT-4o
File search
max_num_results = 5
max_chunk_size_tokens = 800
chunk_overlap_tokens = 400
Size of vector store = 117 KB
Context length for GPT-4o = 128k tokens
I understand that max_prompt_tokens and max_completion_tokens aren’t viable solutions as they halt generation when the token limit is reached.
My questions are:
Has anyone found effective strategies to reduce costs without sacrificing performance?
Is there a way to reduce the context length for the same model and ensure smart truncation without impacting performance and is it a good solution?
What are your thoughts on the cost of $0.20 for 10 user questions within a single thread?
Any advice, experiences, or alternative solutions would be greatly appreciated!
The only other parameter you have control over is truncation_strategy, which limits the number of past chat turns to something lower than the maximum (but not by tokens or cost).
You can thus, with that parameter, limit and give the AI a memory of only the last four inputs. Instead, you might simply force the user to restart if they got their document’s answer, and the thread has ballooned from 1000 to 20000 tokens (with appropriate framing of why the user must restart).
Hi!
The File Search Assistant indeed seems very expensive.
An alternative would be to create a Breeb from your files (basically a Breeb is a super-RAG that can then be accessed from any assistant).
It’s 100% free.
Then you have several options for your assistant.
For example, you can create a dedicated GPT connected to the Breeb.
Or you can develop your own chatbot using langchain, with the Breebs connector.
I need help clarifying some points about how file search works, which will be helpful in reducing costs. I’ve read almost all the documentation.
In fact, my assistant will handle user queries and responses in Arabic, which costs more in terms of tokens compared to English. An Arabic statement tokenizes to about three times the amount in English for the same length and meaning.
Are the input tokens in the first user query composed of instructions, retrieved chunks, and the user query?
For subsequent user queries, are the input tokens composed of instructions (only once), retrieved chunks (excluding chunks from previous queries), previous user queries and their assistant responses, and the new user query?
Does the process of retrieving similarity chunks incur any cost? I think no, only the cost of storing the vector store.
If multiple assistants work in the same thread, are all the assistants’ instructions included in the input tokens, or only the chosen assistant’s instructions (I hope it’s the latter)?
Is it reliable to instruct the assistant to include JSON at the end of its response? The response would contain both the user’s answer and some additional JSON data.
When using Assistants, you provide instructions and additional_instructions placed directly into context window. You also can send specifications for your own callable functions.
OpenAI adds your function instructions in special language. They also add their own tools, which are large instruction blocks for how to use the file search, how to use the code interpreter, how to emit parallel tool calls.
When the user input message is ran, the past chat history of a thread is also added. The AI is placed into a loop only exited with a direct response to the user. Its agent-like behavior kicks in, where, based on the user input, it can call a function, receive the response, call a tool, receive the response, each AI output and response being added to the thread in internal steps.
So you can see that a file search the AI invokes immediately results in 15000 tokens of documentation chunks added in the thread. The agent backend then continues with another send of the appended messages and context, to see what the AI wants to do next. The AI can retry when it writes code that returns errors, searches that aren’t satisfactory, only terminating by (hopefully) AI behavior.
AI retries consuming 10000, 20000, 30000 tokens as useless file search results are added to disobeyed instructions:
You pay for the AI model calls that are being made by this semi-autonomous tool, in input and output tokens. File search tool doesn’t care how relevant the search was, it still adds the chunk count.
If budget and specialization are either of your concerns, then chat completions is the place to build your own solution needing no “tools” or iterative calls.