File Search pricing (retreive the docs info)

I’m thinking of using file search in a project that’s currently using code interpreter, and I have a question about the pricing. I understand that file search costs $0.1/GB per day, and that we should also consider the cost of turning the uploaded documents into “chunks” to store them. My question is, is there a cost to then retreiving the information via the assistant?

1 Like

There’s no retrieval cost, but their is a usage if you retrieve 10,000 tokens to add to your context, that’s 10,000 more input tokens you will be charged for.

2 Likes

Does the whole file get added to the context each time? I’m uploading a file, the vector store is created and then each questions has ~17k tokens as input. Price goes up quickly. Is there a way to limit how many are actually sent in the context?
Am I understanding correctly how that works?

It (probably) depends.

The help documentation for the File Search tool reads,

What is the File Search tool?

The file_search tool implements several retrieval best practices out of the box to help you extract the right data from your files to augment the model’s responses. For more information, please read our developer documentation.

By default, the file_search tool uses the following settings:

  • Chunk size: 800 tokens
  • Chunk overlap: 400 tokens
  • Embedding model: text-embedding-3-large at 256 dimensions
  • Maximum number of chunks added to context: 20

What are the restrictions for File upload?
The restrictions for uploading a File are:

  • 512 MB per file
  • 5M tokens per file
  • 10k files per vector store
  • 1 vector store per assistant
  • 1 vector store per thread

The overall storage limit for an org is limited to 100 GB.

This doesn’t give is any information about how the model decides what (or how much to keep)

Knowledge files for custom GPTs (probably) aren’t exactly the same as *File Search`, but we might expect them to behave similarly. The help documentation for Knowledge reads,

How does Knowledge work?

You can use the GPT editor to attach up to 20 files to a GPT. Each file can be up to 512 MB in size and can contain 2,000,000 tokens. You can include files containing images, but only the text is currently processed. When you upload a file, the GPT breaks the text up into chunks, creates embeddings (a mathematical way of representing text), and stores them for later use

When a user interacts with your GPT, the GPT can access the uploaded files to get additional context to augment the user’s query. The GPT chooses one of the following methods based on the requirements of the user’s prompt:

Semantic search - Returns relevant text chunks as described above.

Preferred when responding to “Q&A” style prompts, where a specific portion of the source document is required.

Document review - Entire short documents and/or relevant excerpts of larger documents are returned and included along with the prompt as additional context.

Preferred when responding to summarization or translation prompts, where the entire source document is required.

Now, there are more differences than similarities with respect to the restrictions on files that can be uploaded for File Search and those for Knowledge, but the way they are described feels very similar. It would surprise me if opener built two completely different products that perform essentially identical jobs (that’s a Google thing to do.[^Then unceremoniously kill off whichever product is more beloved… RIP Google Reader :headstone:]).

So, if we make the assumption they really only have one (basically) RAG solution behind the scenes, it’s reasonable to expect they would work in essentially the same way and we can look to the language of the Knowledge help document to glean additional insights.

Doing that, that takeaway would be that how much of the document is added to context depends in part on the user message and on the size of the documents. Notably they didn’t define “smaller” or “larger” here.

What I can tell you is many people have observed that assistants are incredibly greedy with retrieval tokens. They apparently pull in anything even tangentially related and often fill up the entire model context.

Because of this behavior and the lack of any controls to limit the amount of retrieved context, some people have found file search to be unpredictably expensive to use and either forego using RAG or implement their own so they can control costs.

If you do have very large documents (or several documents) and you didn’t need the model to have access to the precise, specific language in those documents, you might consider distilling your documents down to contain clear and concise facts to retrieve from rather than including all the superfluous language that surrounds those facts in normal text.

1 Like

Thanks a lot for the detailed answer. In my tests the price has been too high despite the information being only present in a specific section of the documents, it seems it’s using much more than the specific info (as the other content is similar in topic but different content).

I’ll probably explore doing the RAG myself and send only the small context to the assistant (or run my own, not sure). Hopefully file search prices become cheaper or more predictable though in the short term.

1 Like