The model then decides when to retrieve content based on the user Messages. The Assistants API automatically chooses between two retrieval techniques:
it either passes the file content in the prompt for short documents, or
performs a vector search for longer documents
Retrieval currently optimizes for quality by adding all relevant content to the context of model calls. We plan to introduce other retrieval strategies to enable developers to choose a different tradeoff between retrieval quality and model usage cost.
This is kind of a mixed message here. So does it create vector-embeddings or does it add the whole content of the files to the context-window? I feel like it’s just adding the file-contents to the prompt, when I have a look at the usage of tokens in my profile.
It depends on what the size of the document was tbf and what “they” have defined as a large enough document to their API that it merits a vector database
Right now its very expensive to use when its not doing a vector-db retrieval. I’ve got a chatbot and did just a few messages today and used over 65k tokens…
Are you passing the whole document through or just somme portions of it ? It sounds very basic, but using some sentiment classification can allow you to shorten the amount that you send to the model for generation. With smaller inputs, the quality of generation should improve as well
I have two text-files with about of which each has about 3’500 Tokens. There will be a lot more than that in the future. Right now its mostly an FAQ. SO i’ve created an Assistant and uploaded the files there. So it seems like its to short for OpenAI to create embeddings for it and just pass the data of these documents to the query.
I’m thinking about chunking the documents myself and host it on a vector-db… Its such a blackbox right now and it’s hard to improve accurancy.
As far as I can tell, Assistants only allow uploading documents for GPT4. If i can’t get more control on how the data is accessed by the LLM I might even change to 3.5 Turbo 16k and send the whole content in the query myself. The reasoning of 3.5 is enough for my current task and much cheaper too.
3500 tokens is not a lot tbf and it is understandable why the model might be passing them directly and not create embeddings for them. While passing the document by itself will get the job done, some studies with LLM’s have recently shown that for bigger context windows, the model tends to forget information in the middle and has greater recall for information at the beginning or the end (i have experienced this at 6k tokens myself).
Keeping this is mind, i usually prefer just making embeddings, while it might be an extra step, it will improve your accuracy for sure.