Assistants API too slow for realtime/production?

pf_mentia · November 10, 2023, 7:43pm

Chatting with an assistant through the API can be slow (4 - 8 seconds) for a short prompt and response. Much slower than regular GPT4 responses (1 - 2 seconds).

This bottleneck essentially makes the assistant API impractical for a realtime chatbot/ production use cases.

Is there something fundamental (like reading documents) that make assistants slower? Or is this just due to it being new?

import time 
from openai import OpenAI

def assistant_response(
        input_message: str, 
        assistant_id: str, 
        thread_id: str, 
        player_profile_path: str
    ):

    # 1. Load assistant.
    print("2. Load assistant.")
    s = time.time()
    assistant = client.beta.assistants.retrieve(assistant_id=assistant_id)
    e = time.time()
    print(e - s)

    if len(assistant.file_ids) < 1:
        s = time.time()
        # 2. Createn an open AI file object 
        print("1. Createn an open AI file object ")
        file = client.files.create(
            file=open(player_profile_path, "rb"),
            purpose='assistants'
        )
        e = time.time()
        print(e - s)

    # 3. Load conversation thread based on player ID. 
    print("3. Load conversation thread based on player ID.")
    s = time.time()
    thread = client.beta.threads.retrieve(thread_id=thread_id)
    e = time.time()
    print(e - s)

    # 4. Add new message to thread. 
    print("4. Add new message to thread. ")
    s = time.time()
    message = client.beta.threads.messages.create(
        thread_id=thread.id,
        role="user",
        content=input_message
        # file_ids=assistant.file_ids
    )
    e = time.time()
    print(e - s)

    print("4. Waiting for run to finish. ")
    s = time.time()
    run = openai.beta.threads.runs.create(
    thread_id=thread.id,
    assistant_id=assistant.id
    )

    while run.status !="completed":
        run = openai.beta.threads.runs.retrieve(
            thread_id=thread.id,
            run_id=run.id
        )
        print(run.status)

    messages = openai.beta.threads.messages.list(
    thread_id=thread.id
    )
    e = time.time()
    print(e - s)

    return messages.data[0].content[0].text.value

Output:

2. Load assistant. 
0.18107199668884277 
3. Load conversation thread based on player ID. 
0.1474926471710205 
4. Add new message to thread. 
0.2909998893737793 
4. Waiting for run to finish. 
queued 
queued
in_progress 
in_progress 
in_progress 
in_progress
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress
in_progress 
in_progress 
in_progress 
in_progress
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
7.238811016082764

Total elapsed time: 7.8624725341796875

Foxalabs · November 10, 2023, 8:04pm

Hi and welcome to the Developer Forum!

It’s a combination of things, accounts come in usage Tiers based on how much you’ve spent and how much time since your first spend, so the lower tiers can have a larger latency, also a great deal of extra load is on the system as people test the new features out, this will settle with time, usually takes about a month to calm down, at least it did last time GPT-4 was released, might take a bit longer with all the new stuff this time.

pf_mentia · November 10, 2023, 8:21pm

Okay, thank you!

Just to clarify, its not a fundamental difference in functionality that slows down the assistants, but rather the combination of things you mentioned.

And therefore in theory, as a user, I would eventually see the same or similar response times using assistants as I do when using GPT4 through the API, as things cool down?

EDIT: I just checked our GPT4 response times (no assistant), and they are also up in the 8 second range. I am guessing this has to do with an influx related to more people using GPT like @Foxabilo mentioned.

LePaladin · November 12, 2023, 9:07pm

On my side, I’ve seen the same thing, but the output is taking ages to come.

My code looks like yours, even if it’s node.js

console.time('createAndRun')
const run = await openai.beta.threads.createAndRun({
  assistant_id: assistant.id,
  thread: {
    messages: [{ role: 'user', content: transcription.text }],
  },
})

// wait for the run to be completed via its status
let status = run.status
while (status !== 'completed') {
  const newrun = await openai.beta.threads.runs.retrieve(run.thread_id, run.id)
  status = newrun.status
}
console.timeEnd('createAndRun')

which gives me, for a small answer of two sentences : createAndRun: 17.337s

17 seconds seems to be really large.

I hope it will slow down

weberson.pontes · November 13, 2023, 8:15pm

Things get worse when you upload bigger files! In my case, sometimes it takes more than 30 seconds to answer a simple question. I uploaded a pdf file (around 7K words) and my assistant is supposed to answer questions based on information in the file. It works fine but it is not a production ready solution. it is taking too much time to answer the questions. Also, it seems that every time a question is made, the model is retrained with the content in the document.

jeff17 · November 15, 2023, 4:23pm

I am having the same issue, today especially. I have had it last over 2 minutes at which point the script stops.

_j · November 15, 2023, 4:43pm

There’s a slight bit of diagnosis available - not that it will get the answer you paid for though.

https://platform.openai.com/docs/api-reference/runs/listRunSteps

You can see if the AI was doing things other than just generating an answer.

anon10827405 · November 15, 2023, 4:58pm

I’ve noticed the same.

One of my assistants’ function calling actually calls the base GPT model for some work and it’s insane how different they are.

It can be almost 10 seconds for a small response from Assistants, but less than 3 seconds for a large paragraph using the base model.

I’m guessing they are just incredibly overloaded. Then I’m seeing a lot of people just straight up doing

while (true) {
  retrieveRun(...params)
}

No sleeps. Just fucking spam the hell outta the endpoint. I mean, why the hell do we have to poll their servers no subscriptions? Bruh

You may want to consider added more options. It’s not just “completed”. For the sake of being nice please add a timeout!

const finishReasons = ["requires_action", "cancelling", "cancelled", "failed", "completed", "expired"]
while(!finishReasons.includes(run.status)){
  run = await openai.beta.threads.runs.retrieve(run.thread_id, run.id);
  await new Promise(resolve => setTimeout(resolve, 1000));
}

taha.khan8899 · December 5, 2023, 10:02pm

The assistants API is essentially useless for me because of the response time.

scharleswatson · December 7, 2023, 7:57pm

I agree with this. Right now I’m getting 30 second wait times for response. When the chat completion can range from a few seconds to maybe 30, it makes no sense to switch to assistant for general chatting at this point. Maybe using the assistant api isn’t really supposed to be for chat but for more behind the scene bot-swarm type stuff? I don’t know but right now it is pretty much useless for my use-case.

derrickob · December 7, 2023, 9:29pm

It’s the best way I’ve managed to deploy a use case for the assistant. Most people just initiate a run and wait for the run to complete when the steps to me are the best shots at getting everything. Also, I see the users above being rude to the endpoint Imagine a while loop without rest

tom_t · December 8, 2023, 8:07am

Try to analyze what it is doing (playground would be the easiest path). For example in data analysis scenario: you pass a file with different format than expected - it will try to read the data anyway - sometimes in few attemps - and it takes time (and tokens).

bookstaber · December 12, 2023, 9:06pm

We’re now more than a month in and Assistants I’m working on aren’t running any more quickly. Has anybody seen an improvement?

scharleswatson · December 12, 2023, 9:10pm

not at all. and even for chat completions I am getting 30+ second waits on longer answers.

BrianLovesAI · December 13, 2023, 3:46am

Yes, you are right. The assistant’s tool is very slow. It thinks, does things, and is impressive. However, the fact is, you cannot use it for production yet.

I believe gpt-4-turbo has to be even faster with more capabilities, which will also improve the assistant’s tool speed and give satisfaction to chat completion users as well.

jungleau · January 22, 2024, 1:08am

For me, when there are documents uploaded to the assistant (even small documents), it can take up to 90+ seconds to respond.

However, when there is no document uploaded to the assistant, it seems that the response time is reduced significantly, but still can take up to 20 seconds in some cases. This is kind of “usable” to me, but would really like to see if Open AI can get it improved

10eputzen · January 29, 2024, 2:56pm

The minimium I got was 20-30 seconds. Usually it takes around 40 seconds. Not really usable on production right now.

CinematicDev · January 29, 2024, 8:11pm

It depends. When documents are uploaded with the assistant it can take forever. When it’s just the instruction it works well and i get responses within about 5 seconds

ManhNguyen · February 1, 2024, 9:26am

Do you know when OpenAI is ready for production on Assistant API ? The quality of retrieval and function calling are impressive for our bot application, but the speed is so slow.

tytung2020 · February 5, 2024, 6:09am

yeah it is too slow at the moment. But I like the thread functionality, which automatically manages chat history, and the file.id system. I don’t want to code a lot of function to do chat history management, which may require a database to manage.

Topic		Replies	Views
We proved the API is intentionally slow API	56	18652	May 2, 2023
[Critical] Over 25% Assistant API Request Timeout Randomly API	81	6225	March 18, 2024
Gpt-3.5-turbo-1106 is very slow API chatgpt	46	8075	December 19, 2023
GPT-3.5 Turbo API response is slow API	20	12603	November 11, 2023
Assistants API Performance API api , assistants-api	11	2946	March 21, 2024

Assistants API too slow for realtime/production?

Related topics