How knowledge base files are handled (Assistants API)


I have a question related to how the Assitants work in relation to the files we add as knowledge base (using the Assistant API, at least, I think it’s like that in Playground too).

My question is about how GPT treats these files. Based on the cost I am having when using Assistant API, it seems that every time we query the assistant it reads the knowledge base document.

If this is the case. It is not converting the document into embeddings but integrating it somehow with the query prompt. What would be the sequence it performs? Does it effectively read the document in each query? Why don’t they use the embeddings (that would make the process cheaper)?

Thanks in advance.

1 Like

It depends on how big the file(s) are. There’s not a lot of documentation for the underlying mechanics.

The model then decides when to retrieve content based on the user Messages. The Assistants API automatically chooses between two retrieval techniques:

  • it either passes the file content in the prompt for short documents, or
  • performs a vector search for longer documents

Retrieval currently optimizes for quality by adding all relevant content to the context of model calls. We plan to introduce other retrieval strategies to enable developers to choose a different tradeoff between retrieval quality and model usage cost.

So it does perform similarity search when the requirement is met.

I would highly recommend exploring your own RAG instead. Assistants Retrieval may one day be ideal but it’s black boxx’d, expensive, and hasn’t received any noticeable updates for 3 months now.


“Assistants Retrieval may one day be ideal but it’s black boxx’d, expensive.” Would you please elaborate on how expensive it is? I’m looking into using Assistant and would like to have an idea how much it would cost. Thanks.

Sure. So the static cost is storage:

Retrieval is priced at $0.20/GB per assistant per day. Attaching a single file ID to multiple assistants will incur the per assistant per day charge when the retrieval tool is enabled. For example, if you attach the same 1 GB file to two different Assistants with the retrieval tool enabled (e.g., customer-facing Assistant #1 and internal employee Assistant #2), you’ll be charged twice for this storage fee (2 * $0.20 per day). This fee does not vary with the number of end users and threads retrieving knowledge from a given assistant.

In addition, files attached to messages are charged on a per-assistant basis if the messages are part of a run where the retrieval tool is enabled. For example, running an assistant with retrieval enabled on a thread with 10 messages each with 1 unique file (10 total unique files) will incur a per-GB per-day charge on all 10 files (in addition to any files attached to the assistant itself).

The second paragraph was silently added in the past couple weeks.

In comparison using a Vector Database such as Pinecone the cost is $0.33/GB/MONTH (OpenAI is $6/GB/Month) . To be fair there’s also read/write but as seen below they still don’t cost more than the storage fee. There is an initial cost as well but all new Serverless Vector Databases comes with $100 free credit.

For reference $3.35 for 1M records would indicate roughly 10GiB of storage.

So for reference if you had 5 Assistants with 1GB of documents to each one you would be paying $1 / Day. With Pinecone it would be $0.01 / Day or $0.33 / Month (Only need to have a single database)

If you wanted to attach files at the thread-level as well and for whatever reason wanted to attach the same 1GB file to 5 threads it would be an additional $1 / Day.

Realistically it doesn’t make sense to attach 1GB files, but each Assistant is limited to 10 files, so people are encouraged to attach massive files.

This is just for storage.

Then, it comes to implementation. We don’t know how OpenAI implements the retrieval process but I think it’s safe to assume it’s token stuffing. You will notice people noticing/complaining that they are being charged a lot of “context tokens”. It seems that OpenAI will continuously retrieve chunks until the model is satisfied with the results and will return them.

Basically the current implementation favors quality over everything. With GPT-4 the context length is ~120k. This can amount to $2 in a retrieval. There’s no control over this besides using a lesser model.

Lastly, this isn’t to do with cost but there is a LOT more to deal with RAG than simple semantic comparisons. If you notice that the results aren’t satisfactory your only option is to try and improve the document. In comparison to a vector DB you have a massive amount of tools & metrics to work with.

You can self-host Weaviate (I recommend as it comes with a lot of functionalities for tinkering), or even just host the vectors yourself in a file. So you can tinker, test, improve without spending much/any money. Heck, I wouldn’t even recommend using ada-embedding anymore.

Here’s a leaderboard to find some open-source and self-hosting models (you can use Google Colab to host)

And finally, E5, which I have been using and love:


Thank you for a thorough answer. I appreciate that.

Sorry for the question, but what is a RAG? (I’m a noob at this)

Thanks in advance

This is a reply from (ref)

RAG stands for Retrieval Augmented Generation. It’s a technique used in AI models to examine the latest user input and the context of the conversation, and then use embeddings or other search techniques to fill the AI model context with specialized knowledge relevant to the topic. This is usually used for making an AI that can answer about closed-domain problems, such as a company knowledgebase.

The phrase Retrieval Augmented Generation (RAG) comes from a recent paper ( by Lewis et al. from Facebook AI. The idea is to use a pre-trained language model (LM) to generate text, but to use a separate retrieval system to find relevant documents to condition the LM on.

In practice, RAG can be combined with tools like Elasticsearch or Qdrant to enhance the performance of AI models. It can also be fine-tuned for specific use-cases, and can be used in conjunction with Few-Shot Learning to boost the model’s performance and reduce hallucinations.


1 Like

Hi, I’m new to here, and assistant api.
I don’t quite understand how Retrieval work when breaking it down. I hope you guys can verify if I’m on the right track

From my understandings, One ask a assistant a question if assistant thinks it needs context it will trigger the RAG stuff? And retrieval function will keep passing words appending into the message until the Assistant satisfy? and then started to generate answer based on the info Retrieval function passed. And those context being passed will be considered as input tokens? (this is my question)

For example in my case.

I’m creating a assistant with 2 file attached. When I ask it which requires the context of those files, the RAG will start passing things, and those will be counted input token? And I will be charge with like (GPT 4 turbo input 0.1 usd / 1ktoken)?

seems that The retrieval model tends to pass the whole files instead of just the most relative things, which result in steep price?

In my case I only ask a question and it says 10633 tokens (10122 input + output)
Does that mean only one question will cost 1 dollar at least? That’s awful lot of money.

(Sorry for bad English. I’m not a native speaker. Hope this wouldn’t be an issue)
Thanks for your patient and time :+1:

And what is context token? (input token?) I don’t think so if it’s the case then I will be charging like 20 usd for that, but it’s 3.5 usd this month.

You’re on the right track.

It depends on how big the file is. If it’s small enough (Don’t know the threshold, possibly token limit), the whole file will be passed as context.

If it’s larger then the responsibility is transferred over to a vector database, which has the file contents “embedded” into vector representations of what the LLM model registers. This allows for similarity searching to match concepts together.

A question doesn’t necessarily cost $1. It can be more, or less. It can be uncontrollably expensive though, and is typical to see people being charged >$1 for a single question. Makes one wonder how much money Custom GPTs are costing OpenAI right now.

The truth is: we don’t know anything else. OpenAI hasn’t gone into any technical details regarding their retrieval system. We believe that the returned chunks from the vector database are then parsed by GPT and determined if they are satisfactory. If not, the 2nd place is returned, etc etc. Which can lead to a lot of context tokens.

We call this “token stuffing” and is generally frowned upon. I would liken it to brute-forcing. A weak, resource-intensive method that lacks any strategy.


I highly recommend getting your feet wet and building your own RAG system.

Weaviate is perfect for someone who likes to tinker, and can be self-hosted (Comes in a docker container for easy transfer to the Cloud)

Pinecone is also wonderful. They have amazing documentation and offer a single pod for free. Very clean, and intuitive. Just not as much tinkering functionality as Weaviate.

You can self-host the vectors yourself in a simple file, but in my opinion it just doesn’t make sense. Eventually you would want to use a powerful DB technology and would need to learn the client libraries. So it’s nice to just jump in.

Lastly. When you think of this process. It becomes obvious that a super AI brain isn’t necessary for some of the steps. For example, you can try to get away with a much smaller model to determine if the returned context is satisfactory. There are providers like AWS that are coming out with these technologies.

It’s all very exciting!

1 Like

Hi there!

I have the same problem with an uncontrolled and significant token consuming while using Assistant Retrieval feature. Sometimes Assistant consumes 7 to 15k tokens just to reply on one question.

Do you have any ideas how to reduce a token usage? Any tips or tricks (besides having developed my own RAG system)? Thanks!

How many documents do you have
How big are they
What file type are they?

It’s just one .TXT of 1MB with text data and markdown. The file contains knowledge base articles (no HTML tags or anything else).

Try running the TXT file through GPT and ask it to distill, condense, and convert to MD while retaining the semantics and facts of the document, to be parsed by an embedding model

You can also try this. I’m interested to see how it works:

You are a Sparse Priming Representation (SPR) writer. An SPR is a particular kind of use of language for advanced NLP, NLU, and NLG tasks, particularly useful for the latest generation of Large Language Models (LLMs). You will be given information by the USER which you are to render as an SPR.

LLMs are a kind of deep neural network. They have been demonstrated to embed knowledge, abilities, and concepts, ranging from reasoning to planning, and even to theory of mind. These are called latent abilities and latent content, collectively referred to as latent space. The latent space of an LLM can be activated with the correct series of words as inputs, which will create a useful internal state of the neural network. This is not unlike how the right shorthand cues can prime a human mind to think in a certain way. Like human minds, LLMs are associative, meaning you only need to use the correct associations to "prime" another model to think in the same way.

Render the input as a distilled list of succinct statements, assertions, associations, concepts, analogies, and metaphors. The idea is to capture as much, conceptually, as possible but with as few words as possible. Write it in a way that makes sense to you, as the future audience will be another language model, not a human. Use complete sentences.

Unfortunately the only option you really have is to try and process the document(s) better for retrieval. OpenAI has not documented any of their embedding techniques so it’s a shot in the dark.

RAG is not as scary as it sounds and I highly recommend moving towards it for now until the retrieval system is atleast out of beta

1 Like

Ronald, thank you for the idea! In our case, the articles in the file contain exact instructions and details that can’t be summarized or changed. As for now, building your own RAG system looks like the best solution.

1 Like

One of the areas a summation approach can be of use is when attempting to chunk larger documents. The temptation is often to use large chunk sizes, which can “dilute” the semantic meaning over many words and concepts compared to the narrowly defined search terms.

A summation of each large “chunk” would be very useful to move that chunk into a more defined location in the embedding space, and thus increasing the likelihood of a good match being found. Adding metadata to each summated chunk that points to the original text would also be useful I think.

Thank you very much! That makes sense. Can you provide any useful articles or research on this topic?

might take a look at SciBERTSUM: Extractive Summarization for Scientific Documents

But it should be noted that simply creating a summary on a per page/paragraph level will usually be sufficient.

I personally like the chunking system I have that chunks to “about 500 words” or one typical page but it will intelligently go slightly over or under depending on paragraphs. You can then summarise each of these chunks and then embed the summary along with a metadata entry pointing to the original text, you could then search via the summary but provide the full page as context when evaluating it in the LLM.