Assistant API request is taking very long response time

imbrucehu · December 20, 2024, 5:05pm

I’m experiencing a very long response time (30+ s) when requesting assistant api since this morning, and I’m experiencing the same slow response on playground, is anyone else having the same issue?

_j · December 20, 2024, 7:57pm

Please give the specified model, the enabled tools and size of data they may be working with, and the complexity of tasks.

I’ll try to replicate for the particular case to develop an understanding.

There may be cases of a single call taking a long time to begin responding when there are some API issues with models. Multiple calls can be internally used, with more chance of “finding” this. The internal operations of Assistants with tools making aborting and restarting a slow-to-respond case difficult even with streaming.

Preliminary look at API models themselves currently shows gpt-4o-2024-11-20 the winner, outpacing -mini in generation speed.

Test Runs

For prompt_tokens=1082 = uncached

For 3 trials of gpt-4o-2024-08-06 @ 2024-12-20 11:50AM:

Stat	Average	Cold	Minimum	Maximum
stream rate	Avg: 42.200	Cold: 43.2	Min: 39.1	Max: 44.3
latency (s)	Avg: 0.528	Cold: 0.7789	Min: 0.3923	Max: 0.7789
total response (s)	Avg: 6.589	Cold: 6.6816	Min: 6.1685	Max: 6.9183
total rate	Avg: 38.939	Cold: 38.314	Min: 37.003	Max: 41.501
response tokens	Avg: 256.000	Cold: 256	Min: 256	Max: 256

For 3 trials of gpt-4o-2024-05-13 @ 2024-12-20 11:50AM:

Stat	Average	Cold	Minimum	Maximum
stream rate	Avg: 38.000	Cold: 32.0	Min: 32.0	Max: 43.3
latency (s)	Avg: 0.432	Cold: 0.381	Min: 0.381	Max: 0.483
total response (s)	Avg: 7.250	Cold: 8.3515	Min: 6.3291	Max: 8.3515
total rate	Avg: 35.773	Cold: 30.653	Min: 30.653	Max: 40.448
response tokens	Avg: 256.000	Cold: 256	Min: 256	Max: 256

For 3 trials of gpt-4o-2024-11-20 @ 2024-12-20 11:50AM:

Stat	Average	Cold	Minimum	Maximum
stream rate	Avg: 68.800	Cold: 102.1	Min: 48.2	Max: 102.1
latency (s)	Avg: 0.587	Cold: 0.467	Min: 0.467	Max: 0.759
total response (s)	Avg: 4.697	Cold: 2.9656	Min: 2.9656	Max: 5.8227
total rate	Avg: 59.519	Cold: 86.323	Min: 43.966	Max: 86.323
response tokens	Avg: 256.000	Cold: 256	Min: 256	Max: 256

For 3 trials of gpt-4o-mini @ 2024-12-20 11:50AM:

Stat	Average	Cold	Minimum	Maximum
stream rate	Avg: 50.233	Cold: 57.0	Min: 44.7	Max: 57.0
latency (s)	Avg: 0.468	Cold: 0.4229	Min: 0.4229	Max: 0.538
total response (s)	Avg: 5.592	Cold: 4.8928	Min: 4.8928	Max: 6.1429
total rate	Avg: 46.198	Cold: 52.322	Min: 41.674	Max: 52.322
response tokens	Avg: 256.000	Cold: 256	Min: 256	Max: 256

Analysis

Analysis of AI Model Latency and Performance

This analysis compares the performance of four versions of the GPT-4o model and identifies areas of underperformance, especially focusing on stream rate, latency, and total response time, which are critical for responsiveness.

Key Observations:

Stream Rate:
- Underperformance:
  - GPT-4o-2024-05-13 exhibits the lowest average stream rate (38.0 tokens/s), significantly underperforming compared to others.
  - Cold start stream rate (32.0 tokens/s) is particularly poor, affecting user experience during initialization.
- Better Performance:
  - GPT-4o-2024-11-20 leads in average stream rate (68.8 tokens/s), with the highest cold start stream rate (102.1 tokens/s).
  - GPT-4o-mini shows moderate performance (50.2 tokens/s), but falls short of being 50% faster than other models like GPT-4o-2024-08-06.
Latency:
- Underperformance:
  - GPT-4o-2024-11-20 exhibits the highest average latency (0.587 s), indicating delayed processing despite its strong streaming rate.
- Better Performance:
  - GPT-4o-2024-05-13 has the lowest average latency (0.432 s), suggesting better processing efficiency.
Total Response Time:
- Underperformance:
  - GPT-4o-2024-05-13 has the longest average total response time (7.250 s), primarily due to its slower stream rate and cold start performance.
- Better Performance:
  - GPT-4o-2024-11-20 has the shortest total response time (4.697 s), indicating a balance between high stream rate and latency management.
Mini-Model Performance:
- While GPT-4o-mini is designed to be 50% faster, its stream rate (50.2 tokens/s) and total response time (5.592 s) only moderately outperform the standard models. It is not achieving the expected 50% performance boost relative to GPT-4o-2024-08-06 or GPT-4o-2024-11-20.

Areas for Improvement:

Cold Start Performance:
- GPT-4o-2024-05-13 and GPT-4o-mini show room for optimization during initialization, which impacts cold start metrics.
Mini-Model Optimization:
- GPT-4o-mini does not meet the benchmark of being 50% faster in terms of stream rate and response time. Focused optimization is required to leverage its potential.
Latency Consistency:
- GPT-4o-2024-11-20, despite excelling in total response time, suffers from higher latency (0.587 s). This suggests inefficiencies in initial token generation that should be addressed.

Summary:

While GPT-4o-2024-11-20 generally leads in performance, its latency issues warrant attention. The mini-model, expected to be substantially faster, does not fully meet its performance target, necessitating further tuning. Meanwhile, GPT-4o-2024-05-13’s cold start and stream rate weaknesses indicate an area ripe for improvement.

imbrucehu · December 23, 2024, 2:18pm

Hi， first of all thank you so much for the thorough analysis and response, and sorry for the late reply. The issue occurred while I was specifically using Assistant API, 4o model, the prompt tokens were around the same as you test case, and I used file_search tool with a vector store the size of 194kb. One of the main reasons I chose to use assistant api was to ensure convo continuity by manually repeating the same thread id. Upon your comparative analysis, all models, despite the differences in response time, didn’t show an essential difference (As in my case the total response(s) averaged in 25 seconds). Is this an issue with the internal operation of Assistant?

Topic		Replies	Views
Why Assistants API is Slow? Any speed solution? API api-speed , openai , rag , assistants-api	15	9188	September 10, 2024
Assistant API takes long to respond Bugs gpt-4 , api	12	3649	August 27, 2024
Runs randomly take > 30sec Bugs assistants-api	7	727	September 11, 2024
Assistants API Performance API api , assistants-api	11	2909	March 21, 2024
GPT-4o-2024–08–06 slower then previous version API gpt-4o	9	1093	January 7, 2025