Assistant API request is taking very long response time

I’m experiencing a very long response time (30+ s) when requesting assistant api since this morning, and I’m experiencing the same slow response on playground, is anyone else having the same issue?

Please give the specified model, the enabled tools and size of data they may be working with, and the complexity of tasks.

I’ll try to replicate for the particular case to develop an understanding.

There may be cases of a single call taking a long time to begin responding when there are some API issues with models. Multiple calls can be internally used, with more chance of “finding” this. The internal operations of Assistants with tools making aborting and restarting a slow-to-respond case difficult even with streaming.

Preliminary look at API models themselves currently shows gpt-4o-2024-11-20 the winner, outpacing -mini in generation speed.

Test Runs

For prompt_tokens=1082 = uncached

For 3 trials of gpt-4o-2024-08-06 @ 2024-12-20 11:50AM:

Stat Average Cold Minimum Maximum
stream rate Avg: 42.200 Cold: 43.2 Min: 39.1 Max: 44.3
latency (s) Avg: 0.528 Cold: 0.7789 Min: 0.3923 Max: 0.7789
total response (s) Avg: 6.589 Cold: 6.6816 Min: 6.1685 Max: 6.9183
total rate Avg: 38.939 Cold: 38.314 Min: 37.003 Max: 41.501
response tokens Avg: 256.000 Cold: 256 Min: 256 Max: 256

For 3 trials of gpt-4o-2024-05-13 @ 2024-12-20 11:50AM:

Stat Average Cold Minimum Maximum
stream rate Avg: 38.000 Cold: 32.0 Min: 32.0 Max: 43.3
latency (s) Avg: 0.432 Cold: 0.381 Min: 0.381 Max: 0.483
total response (s) Avg: 7.250 Cold: 8.3515 Min: 6.3291 Max: 8.3515
total rate Avg: 35.773 Cold: 30.653 Min: 30.653 Max: 40.448
response tokens Avg: 256.000 Cold: 256 Min: 256 Max: 256

For 3 trials of gpt-4o-2024-11-20 @ 2024-12-20 11:50AM:

Stat Average Cold Minimum Maximum
stream rate Avg: 68.800 Cold: 102.1 Min: 48.2 Max: 102.1
latency (s) Avg: 0.587 Cold: 0.467 Min: 0.467 Max: 0.759
total response (s) Avg: 4.697 Cold: 2.9656 Min: 2.9656 Max: 5.8227
total rate Avg: 59.519 Cold: 86.323 Min: 43.966 Max: 86.323
response tokens Avg: 256.000 Cold: 256 Min: 256 Max: 256

For 3 trials of gpt-4o-mini @ 2024-12-20 11:50AM:

Stat Average Cold Minimum Maximum
stream rate Avg: 50.233 Cold: 57.0 Min: 44.7 Max: 57.0
latency (s) Avg: 0.468 Cold: 0.4229 Min: 0.4229 Max: 0.538
total response (s) Avg: 5.592 Cold: 4.8928 Min: 4.8928 Max: 6.1429
total rate Avg: 46.198 Cold: 52.322 Min: 41.674 Max: 52.322
response tokens Avg: 256.000 Cold: 256 Min: 256 Max: 256
Analysis

Analysis of AI Model Latency and Performance

This analysis compares the performance of four versions of the GPT-4o model and identifies areas of underperformance, especially focusing on stream rate, latency, and total response time, which are critical for responsiveness.


Key Observations:

  1. Stream Rate:

    • Underperformance:
      • GPT-4o-2024-05-13 exhibits the lowest average stream rate (38.0 tokens/s), significantly underperforming compared to others.
      • Cold start stream rate (32.0 tokens/s) is particularly poor, affecting user experience during initialization.
    • Better Performance:
      • GPT-4o-2024-11-20 leads in average stream rate (68.8 tokens/s), with the highest cold start stream rate (102.1 tokens/s).
      • GPT-4o-mini shows moderate performance (50.2 tokens/s), but falls short of being 50% faster than other models like GPT-4o-2024-08-06.
  2. Latency:

    • Underperformance:
      • GPT-4o-2024-11-20 exhibits the highest average latency (0.587 s), indicating delayed processing despite its strong streaming rate.
    • Better Performance:
      • GPT-4o-2024-05-13 has the lowest average latency (0.432 s), suggesting better processing efficiency.
  3. Total Response Time:

    • Underperformance:
      • GPT-4o-2024-05-13 has the longest average total response time (7.250 s), primarily due to its slower stream rate and cold start performance.
    • Better Performance:
      • GPT-4o-2024-11-20 has the shortest total response time (4.697 s), indicating a balance between high stream rate and latency management.
  4. Mini-Model Performance:

    • While GPT-4o-mini is designed to be 50% faster, its stream rate (50.2 tokens/s) and total response time (5.592 s) only moderately outperform the standard models. It is not achieving the expected 50% performance boost relative to GPT-4o-2024-08-06 or GPT-4o-2024-11-20.

Areas for Improvement:

  1. Cold Start Performance:

    • GPT-4o-2024-05-13 and GPT-4o-mini show room for optimization during initialization, which impacts cold start metrics.
  2. Mini-Model Optimization:

    • GPT-4o-mini does not meet the benchmark of being 50% faster in terms of stream rate and response time. Focused optimization is required to leverage its potential.
  3. Latency Consistency:

    • GPT-4o-2024-11-20, despite excelling in total response time, suffers from higher latency (0.587 s). This suggests inefficiencies in initial token generation that should be addressed.

Summary:

While GPT-4o-2024-11-20 generally leads in performance, its latency issues warrant attention. The mini-model, expected to be substantially faster, does not fully meet its performance target, necessitating further tuning. Meanwhile, GPT-4o-2024-05-13’s cold start and stream rate weaknesses indicate an area ripe for improvement.

Hi, first of all thank you so much for the thorough analysis and response, and sorry for the late reply. The issue occurred while I was specifically using Assistant API, 4o model, the prompt tokens were around the same as you test case, and I used file_search tool with a vector store the size of 194kb. One of the main reasons I chose to use assistant api was to ensure convo continuity by manually repeating the same thread id. Upon your comparative analysis, all models, despite the differences in response time, didn’t show an essential difference (As in my case the total response(s) averaged in 25 seconds). Is this an issue with the internal operation of Assistant?