Assistants API too slow for realtime/production?

Chatting with an assistant through the API can be slow (4 - 8 seconds) for a short prompt and response. Much slower than regular GPT4 responses (1 - 2 seconds).

This bottleneck essentially makes the assistant API impractical for a realtime chatbot/ production use cases.

Is there something fundamental (like reading documents) that make assistants slower? Or is this just due to it being new?

import time 
from openai import OpenAI

def assistant_response(
        input_message: str, 
        assistant_id: str, 
        thread_id: str, 
        player_profile_path: str
    ):

    # 1. Load assistant.
    print("2. Load assistant.")
    s = time.time()
    assistant = client.beta.assistants.retrieve(assistant_id=assistant_id)
    e = time.time()
    print(e - s)

    if len(assistant.file_ids) < 1:
        s = time.time()
        # 2. Createn an open AI file object 
        print("1. Createn an open AI file object ")
        file = client.files.create(
            file=open(player_profile_path, "rb"),
            purpose='assistants'
        )
        e = time.time()
        print(e - s)

    # 3. Load conversation thread based on player ID. 
    print("3. Load conversation thread based on player ID.")
    s = time.time()
    thread = client.beta.threads.retrieve(thread_id=thread_id)
    e = time.time()
    print(e - s)

    # 4. Add new message to thread. 
    print("4. Add new message to thread. ")
    s = time.time()
    message = client.beta.threads.messages.create(
        thread_id=thread.id,
        role="user",
        content=input_message
        # file_ids=assistant.file_ids
    )
    e = time.time()
    print(e - s)

    print("4. Waiting for run to finish. ")
    s = time.time()
    run = openai.beta.threads.runs.create(
    thread_id=thread.id,
    assistant_id=assistant.id
    )

    while run.status !="completed":
        run = openai.beta.threads.runs.retrieve(
            thread_id=thread.id,
            run_id=run.id
        )
        print(run.status)

    messages = openai.beta.threads.messages.list(
    thread_id=thread.id
    )
    e = time.time()
    print(e - s)

    return messages.data[0].content[0].text.value

Output:

2. Load assistant. 
0.18107199668884277 
3. Load conversation thread based on player ID. 
0.1474926471710205 
4. Add new message to thread. 
0.2909998893737793 
4. Waiting for run to finish. 
queued 
queued
in_progress 
in_progress 
in_progress 
in_progress
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress
in_progress 
in_progress 
in_progress 
in_progress
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
in_progress 
7.238811016082764 

Total elapsed time: 7.8624725341796875

Hi and welcome to the Developer Forum!

It’s a combination of things, accounts come in usage Tiers based on how much you’ve spent and how much time since your first spend, so the lower tiers can have a larger latency, also a great deal of extra load is on the system as people test the new features out, this will settle with time, usually takes about a month to calm down, at least it did last time GPT-4 was released, might take a bit longer with all the new stuff this time.

1 Like

Okay, thank you!

Just to clarify, its not a fundamental difference in functionality that slows down the assistants, but rather the combination of things you mentioned.

And therefore in theory, as a user, I would eventually see the same or similar response times using assistants as I do when using GPT4 through the API, as things cool down?

EDIT: I just checked our GPT4 response times (no assistant), and they are also up in the 8 second range. I am guessing this has to do with an influx related to more people using GPT like @Foxabilo mentioned.

1 Like

On my side, I’ve seen the same thing, but the output is taking ages to come.

My code looks like yours, even if it’s node.js

console.time('createAndRun')
const run = await openai.beta.threads.createAndRun({
  assistant_id: assistant.id,
  thread: {
    messages: [{ role: 'user', content: transcription.text }],
  },
})

// wait for the run to be completed via its status
let status = run.status
while (status !== 'completed') {
  const newrun = await openai.beta.threads.runs.retrieve(run.thread_id, run.id)
  status = newrun.status
}
console.timeEnd('createAndRun')

which gives me, for a small answer of two sentences : createAndRun: 17.337s

17 seconds seems to be really large.

I hope it will slow down

Things get worse when you upload bigger files! In my case, sometimes it takes more than 30 seconds to answer a simple question. I uploaded a pdf file (around 7K words) and my assistant is supposed to answer questions based on information in the file. It works fine but it is not a production ready solution. it is taking too much time to answer the questions. Also, it seems that every time a question is made, the model is retrained with the content in the document.

I am having the same issue, today especially. I have had it last over 2 minutes at which point the script stops.

There’s a slight bit of diagnosis available - not that it will get the answer you paid for though.

https://platform.openai.com/docs/api-reference/runs/listRunSteps

You can see if the AI was doing things other than just generating an answer.

I’ve noticed the same.

One of my assistants’ function calling actually calls the base GPT model for some work and it’s insane how different they are.

It can be almost 10 seconds for a small response from Assistants, but less than 3 seconds for a large paragraph using the base model.

I’m guessing they are just incredibly overloaded. Then I’m seeing a lot of people just straight up doing

while (true) {
  retrieveRun(...params)
}

No sleeps. Just fucking spam the hell outta the endpoint. I mean, why the hell do we have to poll their servers :laughing: no subscriptions? Bruh

You may want to consider added more options. It’s not just “completed”. For the sake of being nice please add a timeout!

const finishReasons = ["requires_action", "cancelling", "cancelled", "failed", "completed", "expired"]
while(!finishReasons.includes(run.status)){
  run = await openai.beta.threads.runs.retrieve(run.thread_id, run.id);
  await new Promise(resolve => setTimeout(resolve, 1000));
}