I am currently experiencing significant delays, often exceeding 40 seconds, when using assistants with retrieval augmentation in the playground environment. The setup is designed to search within a 1.5MB text file (user manual) that is stored alongside the assistant, and I am using the PlayGround.
With queries exactly matching sentences in the document, the response time is more thant 45 seconds. (As the generated code performs a vector search, I presume most of the time is spent there).
Are there any recommendations or methods to optimize this process? The current response times are impractical for our intended production use.
Thank you for your assistance.
The simplest and the most easiest solution to this would be to manually do the RAG using a vector database and cosine similarity and then use the model to whatever is the next step.
The documentation for RAG with the newer model says it uses a vector search when it feels the data is large enough which i do think is that case at 1.5 mb. Most likely, it might be feeding the whole text into the prompt and using that to do the generation
@udm17 Thank you for your suggestion.
Before OpenAI released the assistant feature, I had explored the manual vector search track. I was expecting better results with the integrated OpenAI solution. However, I am now questioning its practicality. How can assistant retrieval be considered useful if it struggles with efficiently searching a simple 23-page document? This level of inefficiency is concerning for our intended use.
tbh, my suggestion will be to build your own retrieval function and let the openai assistant call it. I think what openai did is for users who want to do some quick retrieval, have small amounts of data, and simply don’t want to build their own retrieval system.
@xifan.wang, your suggestion to build our own retrieval function while still utilizing the OpenAI assistant architecture is indeed a clever approach. It would enable us to use OpenAI’s retrieval for development and then implement our own retrieval system for production. The challenge lies in maintaining a high-performance endpoint for more efficient retrieval. Would you recommend any tools or platforms?
Langchain and Llamaindex are quite good at building their customizable vector database for retrieval. There are also cloud solutions like Pinecone, Weaviate, etc.
@rfroli, Have you tried this yet? If yes, what kind of improvements are you seeing in the latency from the assistant’s response with a custom retrieval function?
No, I haven’t tried this solution yet, as the cost of using the assistant APIs is quite high, and we can’t afford this expense in a production environment. I’ll consider trying it again once it’s out of beta. However, if the costs remain prohibitive, then I will manually perform the RAG tasks by using the search results from my vector search as the context for simple prompts.