I have tried “Retrieval” tool from OpenAI Assistant API which is to slow. It takes (4 - 8 seconds) for a short prompt and response, (7 - 16 seconds) for a long prompt and response.
Assistant details:
Model: gpt-3.5-turbo-1106
No. of files: 1 (.docx)
File size: 23.3 KB
No of pages in file: 10 pages (2993 words)
Is there something fundamental (like reading documents) that make assistants slower? Or is this just due to it being new? Or is there any way to speed it up?
No I did not use anything to enhance the speed. But I was looking to use this assistant for generating quiz based on the pdf that I upload and release it as an api and it did not work properly
Fast way: extract document to plain text yourself. Include as a RAG assistant message after “system” or before user question. See a stream of chat completion response within a second.
Slow way: use another’s service that puts the decoding and access to information behind an embeddings or function. Don’t see anything until you see the response is status:done and then retrieve it.
One other thing to keep in mind is that streaming makes the Chat Completions API feel faster, so streaming being absent from the Assistants API is likely one contributing factor to it feeling slower.
Given your use case you might be better off using the regular chat completion API and passing along your document in each request. Your word document can fit into the context window for the chat completion.
You will have finer control over what is being sent into the context window as well as getting the instant streaming response.
when would i use chat completion vs assistant api? I want my chatbot to answer questions from my knowledge base but every query is taking 6-10k tokens which is too high. How do i optimize for a lower token + speed below 5-7 secs?
It is taking 10 to 15 seconds for a two line response, this is just a normal chat. Not sure what was the reason. Previously a month back it was giving the same response in 3 to 4 seconds. Not sure what feature addition to assistant, increasing overall response time?