Chatting with an assistant through the API can be slow (4 - 8 seconds) for a short prompt and response. Much slower than regular GPT4 responses (1 - 2 seconds).
This bottleneck essentially makes the assistant API impractical for a realtime chatbot/ production use cases.
Is there something fundamental (like reading documents) that make assistants slower? Or is this just due to it being new?
import time
from openai import OpenAI
def assistant_response(
input_message: str,
assistant_id: str,
thread_id: str,
player_profile_path: str
):
# 1. Load assistant.
print("2. Load assistant.")
s = time.time()
assistant = client.beta.assistants.retrieve(assistant_id=assistant_id)
e = time.time()
print(e - s)
if len(assistant.file_ids) < 1:
s = time.time()
# 2. Createn an open AI file object
print("1. Createn an open AI file object ")
file = client.files.create(
file=open(player_profile_path, "rb"),
purpose='assistants'
)
e = time.time()
print(e - s)
# 3. Load conversation thread based on player ID.
print("3. Load conversation thread based on player ID.")
s = time.time()
thread = client.beta.threads.retrieve(thread_id=thread_id)
e = time.time()
print(e - s)
# 4. Add new message to thread.
print("4. Add new message to thread. ")
s = time.time()
message = client.beta.threads.messages.create(
thread_id=thread.id,
role="user",
content=input_message
# file_ids=assistant.file_ids
)
e = time.time()
print(e - s)
print("4. Waiting for run to finish. ")
s = time.time()
run = openai.beta.threads.runs.create(
thread_id=thread.id,
assistant_id=assistant.id
)
while run.status !="completed":
run = openai.beta.threads.runs.retrieve(
thread_id=thread.id,
run_id=run.id
)
print(run.status)
messages = openai.beta.threads.messages.list(
thread_id=thread.id
)
e = time.time()
print(e - s)
return messages.data[0].content[0].text.value
Output:
2. Load assistant.
0.18107199668884277
3. Load conversation thread based on player ID.
0.1474926471710205
4. Add new message to thread.
0.2909998893737793
4. Waiting for run to finish.
queued
queued
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
in_progress
7.238811016082764
It’s a combination of things, accounts come in usage Tiers based on how much you’ve spent and how much time since your first spend, so the lower tiers can have a larger latency, also a great deal of extra load is on the system as people test the new features out, this will settle with time, usually takes about a month to calm down, at least it did last time GPT-4 was released, might take a bit longer with all the new stuff this time.
Just to clarify, its not a fundamental difference in functionality that slows down the assistants, but rather the combination of things you mentioned.
And therefore in theory, as a user, I would eventually see the same or similar response times using assistants as I do when using GPT4 through the API, as things cool down?
EDIT: I just checked our GPT4 response times (no assistant), and they are also up in the 8 second range. I am guessing this has to do with an influx related to more people using GPT like @Foxabilo mentioned.
On my side, I’ve seen the same thing, but the output is taking ages to come.
My code looks like yours, even if it’s node.js
console.time('createAndRun')
const run = await openai.beta.threads.createAndRun({
assistant_id: assistant.id,
thread: {
messages: [{ role: 'user', content: transcription.text }],
},
})
// wait for the run to be completed via its status
let status = run.status
while (status !== 'completed') {
const newrun = await openai.beta.threads.runs.retrieve(run.thread_id, run.id)
status = newrun.status
}
console.timeEnd('createAndRun')
which gives me, for a small answer of two sentences : createAndRun: 17.337s
Things get worse when you upload bigger files! In my case, sometimes it takes more than 30 seconds to answer a simple question. I uploaded a pdf file (around 7K words) and my assistant is supposed to answer questions based on information in the file. It works fine but it is not a production ready solution. it is taking too much time to answer the questions. Also, it seems that every time a question is made, the model is retrained with the content in the document.