I’ve been playing around with Assistants API to build a reservation system for restaurants, in which each client has it’s own thread in order to make the experience even more customized, as it remembers previous interactions.
Retrieval is enabled with a single file of ~10k tokens where the restaurant menu is included. I’m having some serious headaches to get this to production as the input tokens are way higher than expected. I can assume the input tokens from the thread context window + function calling + assistant prompt, but the fact that for every message in a thread, the assistant uses retrieval, although 90% of time it wouldn’t be necessary, it’s making it impossible for me to progress.
Is there a way to limit when the assistant uses retrieval? Tried specifying it in the prompt but it’s not working. Would embeddings api work better for this use case?
Sorry if this makes no sense, very beginner dev here
thread context window + function calling + assistant prompt
is very close, with one minor addition:
thread context window + function calling + assistant prompt + the entire contents of the file being retrieved
when using RAG, at least.
Perhaps my question here is: Why do you need retrieval? Retrieval is built exactly for those who need it at least 90% of the time, and in relation to other distinct bits of information. Therefore, I’m unsure if retrieval is the actual tool for the job you’re trying to do.
as it remembers previous interactions.
Does it? Keep in mind, you can match a client to a thread if you wish, but there is a cutoff in the context window, and it doesn’t “remember” in a sense, but rather “looks at the fat wad of data in front of its face and makes a reasonable conclusion about how to respond”. Any other memory or persistence would basically be your own DB.
I think I need retrieval in order to retrieve the information related with the restaurant menu with prices and allergens, although it would be used ~10% of times. Otherwise the assistant wouldn’t be able to answer questions that requires that information.
But rethinking based on your reply, a new question arises. Would be possible to solve this by adding a function call to the assistant that is based on embeddings? So when the assistant is asked about anything related with the menu, it automatically calls the “embedding_function”? This way retrieval could be removed from the assistant capabilities and the input_tokens would be drastically fewer. Am I making wrong assumptions or does this even make sense?
Thank you again for taking time to read and reply!
Yeah, I would try making a custom function and see how that helps!
Now, the input tokens will be higher when it does retrieve the file, but perhaps by building a custom tool that may reduce the amount of times it is called. Give it a try!