Why Assistants API is Slow? Any speed solution?

I have tried “Retrieval” tool from OpenAI Assistant API which is to slow. It takes (4 - 8 seconds) for a short prompt and response, (7 - 16 seconds) for a long prompt and response.

Assistant details:

  • Model: gpt-3.5-turbo-1106
  • No. of files: 1 (.docx)
  • File size: 23.3 KB
  • No of pages in file: 10 pages (2993 words)

Is there something fundamental (like reading documents) that make assistants slower? Or is this just due to it being new? Or is there any way to speed it up? :thinking:

2 Likes

It is because the assistant api is under development.

2 Likes

Yes you are right, it is still in beta version. Have you try anything to enhance its speed?

1 Like

No I did not use anything to enhance the speed. But I was looking to use this assistant for generating quiz based on the pdf that I upload and release it as an api and it did not work properly

It retrieves information from the file you uploaded. That’s why it’s slow.

1 Like

Fast way: extract document to plain text yourself. Include as a RAG assistant message after “system” or before user question. See a stream of chat completion response within a second.

Slow way: use another’s service that puts the decoding and access to information behind an embeddings or function. Don’t see anything until you see the response is status:done and then retrieve it.

2 Likes

One other thing to keep in mind is that streaming makes the Chat Completions API feel faster, so streaming being absent from the Assistants API is likely one contributing factor to it feeling slower.

2 Likes

Given your use case you might be better off using the regular chat completion API and passing along your document in each request. Your word document can fit into the context window for the chat completion.

You will have finer control over what is being sent into the context window as well as getting the instant streaming response.

2 Likes

when would i use chat completion vs assistant api? I want my chatbot to answer questions from my knowledge base but every query is taking 6-10k tokens which is too high. How do i optimize for a lower token + speed below 5-7 secs?

have you used assistants api v2 with streaming? if that’s too slow, maybe try chat completions + RAG instead of assistants.

To quantify ‘slow’: I just used the assistants API to ask gpt-40-mini “I need to solve the equation 3x + 11 = 14. Can you help me?”.

It took four minutes to respond.

I waited a few minutes and ran it again, the second time it was done in 28 seconds.

I’m not using tools or vector stores or any other messages.

I don’t mind slow for a beta product, but this sort of speed makes it pretty hard to actually even experiment with the thing.

Is anyone experiencing awfully slow GTP4o speeds today? My assistants are taking several minutes to complete simple and short text based requests.

I just used my gpt-4o-mini powered chatbot and it returned in about 2 seconds with the correct answer (first time)

So I believe the answer to this is “use Chat Completions” and local functions if you want performance.

1 Like

I wrote a little page to check the response time (hourly) to make it easy to check back and see if it’s fast yet.

Right now it’s consistently around 2 seconds for the fastest-possible call.

I’m seeing the same issues without adding documents to the assistant

It is taking 10 to 15 seconds for a two line response, this is just a normal chat. Not sure what was the reason. Previously a month back it was giving the same response in 3 to 4 seconds. Not sure what feature addition to assistant, increasing overall response time?