File retrieval in assistant uses huge amount of input tokens

I’m testing a setup to do retrieval with 7GB of data about companies. Data organized in small json snippets, stored in files in jsonl format

I’m uploading about thousand of these jsonl files and asking assistant a simple question, it does find something relevant but omg why is it using almost 6000 tokens in the input?!

I feel like I’m missing something obvious. The cost will be pretty significant: 4$ per 1k searches like this

The data will not be chunked by “logical JSON snippet”. It will be chunked by token counts of documents.

Your input cost with limitless top data can be 20 x (800) or even 20 x (1200 or 1600) - depending on how their overlap documented is interpreted. The only constraint would be the probability of document end “tails” there are with less than 800, if they don’t simply ensure the last chunk is also 800 from end.

You could do some clever preprocessing if performing embeddings on the initial single JSON yourself. After obtaining embeddings (perhaps 2B tokens?) they could then be ranked in 1D space by distances from tasks or whatever amount of nearly free iterative computations you can do overnight on a workstation to sort. To then cluster yourself for highly focused sections.

Of course, if paying for embeddings once on individual items with embeddings targeting the actual data, why would you continue to pay daily for GB of vector database that is (DATA * (>150%) + 1Kvector) * chunks which is confused by literal mixed messaging?

1 Like

Heh, I assumed they do something smart with json lines, but yeah probably overlapping and all kinds of usual embeddings.

That why I was thinking paying for storage actually makes sense, but it seems like embedding is not that smart, and the retrieval is pretty expensive.

Oh so they do count it separately in billing as “context token”

Jeez I just clicked manually a dozen times in the assistant playground and it’s already quarter million tokens - for 300 char long jsonlines.

What’s the use case for this service? As soon as requests number increases it’s just going to cost thousands.

image

A forum search for “use case” - to 10 days after release of Assistants, an adequate exploration does indeed already place such a rhetorical question in one’s mind.

1 Like

Thank you, great overview! I see now, they run whatever - and user have no control.

Then what’s the good architecture currently?

I’m so don’t want to deal with all this embedding garbage with langchains, document chunking, managing vector databases.

Ps do you think they’re going to have a retrieval api eventually? So developers can organize context window as they need

Yeah they are aware, and probably going to fix these obvious problems

By default, the file_search tool uses the following settings:

  • Chunk size: 800 tokens
  • Chunk overlap: 400 tokens
  • Embedding model: text-embedding-3-large at 256 dimensions
  • Maximum number of chunks added to context: 20 (could be fewer)

Known Limitations

We have a few known limitations we’re working on adding support for in the coming months:

  1. Support for modifying chunking, embedding, and other retrieval configurations.
  2. Support for deterministic pre-search filtering using custom metadata.
  3. Support for parsing images within documents (including images of charts, graphs, tables etc.)
  4. Support for retrievals over structured file formats (like csv or jsonl).
  5. Better support for summarization — the tool today is optimized for search queries.

I’ve tested it with another, more traditional use case—analyzing a bunch of PDFs to find savings. It worked pretty well.

I need to search how people solve the problem of dynamically attaching relevant documents to vector store so token usage is minimized, and the cost of file storage is manageable. This problem is actually two related ones: and how to keep track of document metadata, how to find (with some kind of search) what documents are relevant (scope them) and then attach them to vector stores,

When openai solves this problem with lookups and scope - via metadata management perhaps - it will be a blockbuster RAG!

Have you found a solution? Like, how people use the vector store so that token usage is minimised?

We just launched a parameter that allows you to reduce the chunk size and/or the number of chunks retrieved – this will help reduce your token costs:

https://platform.openai.com/docs/assistants/tools/file-search/customizing-file-search-settings

4 Likes

Nice! But i got confused. This new chunking_strategy parameter is used with client.beta.vector_stores.files.create() and requires a file_id. So, before that, we have to upload a file using client.files.create() right? My confusion is that I believed that the chunking was done right when we uploaded the file, using this files.create() but to be done with vector_stores.files.create() it reveals that the first part do not chunk it… is that right?

client.files.create() simply uploads the file. It’s the vector_stores.files.create() call that parses, chunks, embeds, etc.

1 Like