Assistant API Performance is Very Slow

Currently I’m working on a chatbot using the assistants API, and the performance is incredibly slow, with or without files and instructions. The simplest question can take up to 6 seconds, simply saying are you active, meanwhile a more complicated question can take up to 30s to receive an answer. I was concerned I was rate limited since I was only tier 2 on my account, so I tried creating a new account and funding it to start with a clean slate and I got the same results. Is the assistants API just broken now?

2 Likes

I upgraded yesterday our account from Tier1 to Tier 3. It made no difference. It is incredibly slow with huge latency and ongoing errors with the code analyzer. We have some demos of our API based product scheduled for early next week and it is just embarrassing how slow it is.

That being said, overall speed seems a little better when using model gpt-4 instead of gpt-4-1106-preview.

1 Like

Similar situation here, I just finished making this thing feature complete and it’s just a joke, the responses are good, but in 25-30s I’ll get laughed out of the room if I tried to show it off as a complete product. The only hope I have is seeing if I can have it show the text as it’s being generated as it might be able to improve perceived response time.

1 Like

This might help you guys to check before your presentation. You can probably switch the model or time your presentation when it is fastest.

1 Like

So instead of using gpt-3.5-turbo to see significant gains, one should use gpt-4-turbo ? That’s gonna be heavy on the bills :stuck_out_tongue:

GPT-4-Turbo is almost the same price that GPT-3.5-Turbo used to be…

Let’s make a list, and check it twice…

In my chart, cost is per 1 million tokens so you can compare the prices more easily:

Model Training 1M Input usage 1M Output usage 1M Context Length
GPT-3.5-turbo-1106 $n/a $1.00 $2.00 16k (4k out)
GPT-3.5-turbo $n/a $1.50 $2.00 4k
GPT-3.5-turbo-16k $n/a $3.00 $4.00 16k
GPT-3.5 Turbo fine-tune $8.00 $3.00 $6.00 4k
GPT-4-1106 (turbo) $n/a $10.00 $30.00 125k (4k out)
GPT-4 $n/a $30.00 $60.00 8k
-------- ------------- ------------- ------------- -------------
babbage-002 base $n/a $0.40 $0.40 16k
babbage-002 fine-tune $0.40 $1.60 $1.60 16k
-------- ------------- ------------- ------------- -------------
davinci-002 base $n/a $2.00 $2.00 16k
davinci-002 fine-tune $6.00 $12.00 $12.00 16k

Compare Price

tokens gpt-3.5-turbo gpt-4-turbo (1106)
input 1x 6.67x
output 1x 15x

Whats interesting is I don’t think it’s all that slow. I’ve noticed that it’s significantly worse when the text is large versus when it’s small. This is likely due to the lack of streaming responses. I ended up migrating to llama-index and setting up an agent with it’s own custom RAG instead of using open-ai.

I find as more runs are done on the same thread thus building up messages and context, the response times increase. Still useable for me though. I just don’t let threads get too long, I create new one’s for each user session.

Sadly, this is a definite issue. None of my responses is less than 17+ seconds and a much as over 60 seconds. I did an MVP demo today and response was great with but for the obvious comment as to how slow it was.

Appreciate we are still in beta. Is there a roadmap that could be shared on the timeline from Beta to Live. Delivery of my product completely depends on when this will go live and ideally. Rightly, or foolishly I assuming there will naturally be an in performance (I could be in trouble here).

Not wanting to look for alternatives at the moment. I’m using the Wisper endpoints and that’s a little slow at the moment. Was using the ASR and TTS but subbed in a local ARS which is quick and working well. But still using the Wisper endpoint for TTS. Not wanting to do a local solution unless absolutely necessary. Typical responses here varies but for my use case is around 7+ seconds which is not great!

For my use case I need overall query response + TTS to be between 5-10 Max. No idea if that’s achievable with Open AI. Anyone? (appreciate there are a number of factors in play)

1 Like

In way of an update, switched to a local TTS solution which is fast and works great. So, the only slow part of the process is now GPT4 which I need for its great reasoning ability. None of the other models quite does what’s required. So unfortunately, slightly blocked at the moment.

Does anyone have experience of the Microsoft Azure versions from a speed/response perspective?

Would you mind sharing what local TTS solution you started using? Having the same pain with latency from OpenAI TTS, even when streaming the response.