Chatting with an assistant through the API can be slow (4 - 8 seconds) for a short prompt and response. Much slower than regular GPT4 responses (1 - 2 seconds).
This bottleneck essentially makes the assistant API impractical for a realtime chatbot/ production use cases.
Is there something fundamental (like reading documents) that make assistants slower? Or is this just due to it being new?
import time
from openai import OpenAI
def assistant_response(
input_message: str,
assistant_id: str,
thread_id: str,
player_profile_path: str
):
# 1. Load assistant.
print("2. Load assistant.")
s = time.time()
assistant = client.beta.assistants.retrieve(assistant_id=assistant_id)
e = time.time()
print(e - s)
if len(assistant.file_ids) < 1:
s = time.time()
# 2. Createn an open AI file object
print("1. Createn an open AI file object ")
file = client.files.create(
file=open(player_profile_path, "rb"),
purpose='assistants'
)
e = time.time()
print(e - s)
# 3. Load conversation thread based on player ID.
print("3. Load conversation thread based on player ID.")
s = time.time()
thread = client.beta.threads.retrieve(thread_id=thread_id)
e = time.time()
print(e - s)
# 4. Add new message to thread.
print("4. Add new message to thread. ")
s = time.time()
message = client.beta.threads.messages.create(
thread_id=thread.id,
role="user",
content=input_message
# file_ids=assistant.file_ids
)
e = time.time()
print(e - s)
print("4. Waiting for run to finish. ")
s = time.time()
run = openai.beta.threads.runs.create(
thread_id=thread.id,
assistant_id=assistant.id
)
while run.status !="completed":
run = openai.beta.threads.runs.retrieve(
thread_id=thread.id,
run_id=run.id
)
print(run.status)
messages = openai.beta.threads.messages.list(
thread_id=thread.id
)
e = time.time()
print(e - s)
return messages.data[0].content[0].text.value
Output:
2. Load assistant.
0.18107199668884277
3. Load conversation thread based on player ID.
0.1474926471710205
4. Add new message to thread.
0.2909998893737793
4. Waiting for run to finish.
queued
queued
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
7.238811016082764
It’s a combination of things, accounts come in usage Tiers based on how much you’ve spent and how much time since your first spend, so the lower tiers can have a larger latency, also a great deal of extra load is on the system as people test the new features out, this will settle with time, usually takes about a month to calm down, at least it did last time GPT-4 was released, might take a bit longer with all the new stuff this time.
Just to clarify, its not a fundamental difference in functionality that slows down the assistants, but rather the combination of things you mentioned.
And therefore in theory, as a user, I would eventually see the same or similar response times using assistants as I do when using GPT4 through the API, as things cool down?
EDIT: I just checked our GPT4 response times (no assistant), and they are also up in the 8 second range. I am guessing this has to do with an influx related to more people using GPT like @Foxabilo mentioned.
On my side, I’ve seen the same thing, but the output is taking ages to come.
My code looks like yours, even if it’s node.js
console.time('createAndRun')
const run = await openai.beta.threads.createAndRun({
assistant_id: assistant.id,
thread: {
messages: [{ role: 'user', content: transcription.text }],
},
})
// wait for the run to be completed via its status
let status = run.status
while (status !== 'completed') {
const newrun = await openai.beta.threads.runs.retrieve(run.thread_id, run.id)
status = newrun.status
}
console.timeEnd('createAndRun')
which gives me, for a small answer of two sentences : createAndRun: 17.337s
Things get worse when you upload bigger files! In my case, sometimes it takes more than 30 seconds to answer a simple question. I uploaded a pdf file (around 7K words) and my assistant is supposed to answer questions based on information in the file. It works fine but it is not a production ready solution. it is taking too much time to answer the questions. Also, it seems that every time a question is made, the model is retrained with the content in the document.
I agree with this. Right now I’m getting 30 second wait times for response. When the chat completion can range from a few seconds to maybe 30, it makes no sense to switch to assistant for general chatting at this point. Maybe using the assistant api isn’t really supposed to be for chat but for more behind the scene bot-swarm type stuff? I don’t know but right now it is pretty much useless for my use-case.
It’s the best way I’ve managed to deploy a use case for the assistant. Most people just initiate a run and wait for the run to complete when the steps to me are the best shots at getting everything. Also, I see the users above being rude to the endpoint Imagine a while loop without rest
Try to analyze what it is doing (playground would be the easiest path). For example in data analysis scenario: you pass a file with different format than expected - it will try to read the data anyway - sometimes in few attemps - and it takes time (and tokens).
Yes, you are right. The assistant’s tool is very slow. It thinks, does things, and is impressive. However, the fact is, you cannot use it for production yet.
I believe gpt-4-turbo has to be even faster with more capabilities, which will also improve the assistant’s tool speed and give satisfaction to chat completion users as well.
For me, when there are documents uploaded to the assistant (even small documents), it can take up to 90+ seconds to respond.
However, when there is no document uploaded to the assistant, it seems that the response time is reduced significantly, but still can take up to 20 seconds in some cases. This is kind of “usable” to me, but would really like to see if Open AI can get it improved
It depends. When documents are uploaded with the assistant it can take forever. When it’s just the instruction it works well and i get responses within about 5 seconds
Do you know when OpenAI is ready for production on Assistant API ? The quality of retrieval and function calling are impressive for our bot application, but the speed is so slow.
yeah it is too slow at the moment. But I like the thread functionality, which automatically manages chat history, and the file.id system. I don’t want to code a lot of function to do chat history management, which may require a database to manage.