Why is token use so high when using file search?

I’ve created an assistant that uses file search to limit answers to only those that I provide in a source file. When I do this, every question I ask triggers a charge of 18,000 tokens. Why so high? I’m using the gpt-4-turbo model.

You have an option to control for this through a combination of adjusting the chunk size and the parameter file_search.max_num_results. See below for more info:

Customizing File Search settings

You can customize how the file_search tool chunks your data and how many chunks it returns to the model context.

Chunking configuration

By default, max_chunk_size_tokens is set to 800 and chunk_overlap_tokens is set to 400, meaning every file is indexed by being split up into 800-token chunks, with 400-token overlap between consecutive chunks.

You can adjust this by setting chunking_strategy when adding files to the vector store. There are certain limitations to chunking_strategy:

  • max_chunk_size_tokens must be between 100 and 4096 inclusive.
  • chunk_overlap_tokens must be non-negative and should not exceed max_chunk_size_tokens / 2`.

Number of chunks

By default, the file_search tool outputs up to 20 chunks for gpt-4* models and up to 5 chunks for gpt-3.5-turbo. You can adjust this by setting file_search.max_num_results in the tool when creating the assistant or the run.

Note that the file_search tool may output fewer than this number for a myriad of reasons:

  • The total number of chunks is fewer than max_num_results.
  • The total token size of all the retrieved chunks exceeds the token >“budget” assigned to the file_search tool. The file_search tool currently has a token bugdet of:
  • 4,000 tokens for gpt-3.5-turbo
  • 16,000 tokens for gpt-4* models

Source: https://platform.openai.com/docs/assistants/tools/file-search/customizing-file-search-settings

Thank you for that information! I see that by lowering the total number of chunks to 5, I can lower the cost down to about 4,000 tokens for a simple query. This still seems high, however, if you start running the numbers on a public website with a pre-sales bot. Any other suggestions to lower costs, or do we just have to wait for gpt-4 to become as inexpensive as gpt-3.5-turbo?