I’m testing a setup to do retrieval with 7GB of data about companies. Data organized in small json snippets, stored in files in jsonl format
I’m uploading about thousand of these jsonl files and asking assistant a simple question, it does find something relevant but omg why is it using almost 6000 tokens in the input?!
I feel like I’m missing something obvious. The cost will be pretty significant: 4$ per 1k searches like this
The data will not be chunked by “logical JSON snippet”. It will be chunked by token counts of documents.
Your input cost with limitless top data can be 20 x (800) or even 20 x (1200 or 1600) - depending on how their overlap documented is interpreted. The only constraint would be the probability of document end “tails” there are with less than 800, if they don’t simply ensure the last chunk is also 800 from end.
You could do some clever preprocessing if performing embeddings on the initial single JSON yourself. After obtaining embeddings (perhaps 2B tokens?) they could then be ranked in 1D space by distances from tasks or whatever amount of nearly free iterative computations you can do overnight on a workstation to sort. To then cluster yourself for highly focused sections.
Of course, if paying for embeddings once on individual items with embeddings targeting the actual data, why would you continue to pay daily for GB of vector database that is (DATA * (>150%) + 1Kvector) * chunks which is confused by literal mixed messaging?
A forum search for “use case” - to 10 days after release of Assistants, an adequate exploration does indeed already place such a rhetorical question in one’s mind.
I’ve tested it with another, more traditional use case—analyzing a bunch of PDFs to find savings. It worked pretty well.
I need to search how people solve the problem of dynamically attaching relevant documents to vector store so token usage is minimized, and the cost of file storage is manageable. This problem is actually two related ones: and how to keep track of document metadata, how to find (with some kind of search) what documents are relevant (scope them) and then attach them to vector stores,
When openai solves this problem with lookups and scope - via metadata management perhaps - it will be a blockbuster RAG!
Nice! But i got confused. This new chunking_strategy parameter is used with client.beta.vector_stores.files.create() and requires a file_id. So, before that, we have to upload a file using client.files.create() right? My confusion is that I believed that the chunking was done right when we uploaded the file, using this files.create() but to be done with vector_stores.files.create() it reveals that the first part do not chunk it… is that right?