Runs randomly take > 30sec

After some testing, I can confirm that there seems to be some kind of bug on OpenAI’s end.

This is the simplest tool call that I can think to generate using a Run:

def run_with_empty_thread(config:TestConfig) -> TestResult:
   client = get_client()
   start_time = timeit.default_timer()
   # with time_block("run_with_empty_thread", print_output=False) as exec_time:
   with time_block('client.beta.threads.create()'):
      thread = client.beta.threads.create()
   with time_block('client.beta.threads.runs.create_and_poll()'):
      run = client.beta.threads.runs.create_and_poll(
         assistant_id=config.assistant_id,
         thread_id=thread.id,
         model=config.model,
         additional_instructions = config.prompt,
         additional_messages = [{'role': 'user', 'content': config.query}],
         tools=config.tools
      )
   if run.status == 'completed': # will only complete immediately if there are no tools
      with time_block('client.beta.threads.messages.list()'):
         messages = client.beta.threads.messages.list(thread_id=thread.id, limit=1)
      exec_time = timeit.default_timer() - start_time
      result = TestResult(
         completion_time=exec_time,
         tokens_total=run.usage.total_tokens,
         tokens_prompt=run.usage.prompt_tokens,
         tokens_completion=run.usage.completion_tokens,
         tools=[],
         response = messages.data[0].content[0].text.value
      )
      return result
   elif run.status == 'requires_action': # when tools are required, we have to submit a second run
      tool_outputs = [] # fake the output, we only care about openai speed
      tools = run.required_action.submit_tool_outputs.tool_calls
      for tool_call in run.required_action.submit_tool_outputs.tool_calls:
         tool_outputs.append({'tool_call_id': tool_call.id, 'output': 'success'}) # args are: tool_call.function.arguments
      with time_block('client.beta.threads.runs.create_and_poll() - with tool_outputs'):
         run = client.beta.threads.runs.submit_tool_outputs_and_poll(
            run_id=run.id,
            thread_id=run.thread_id,
            tool_outputs = tool_outputs
         )
      if run.status == 'completed':
         with time_block('client.beta.threads.messages.list()'):
            messages = client.beta.threads.messages.list(thread_id=thread.id, limit=1)
         exec_time = timeit.default_timer() - start_time
         result = TestResult(
            completion_time=exec_time,
            tokens_total=run.usage.total_tokens,
            tokens_prompt=run.usage.prompt_tokens,
            tokens_completion=run.usage.completion_tokens,
            tools=tools,
            response = messages.data[0].content[0].text.value
         )
         return result
      else:
         print(f"Unexpected run status: [{run.status}]")
   else:
      print(f"Unexpected run status: [{run.status}]")

On average, this is about a 5 second process:

Execution time of block[client.beta.threads.create()]: 0.200420 seconds
Execution time of block[client.beta.threads.runs.create_and_poll()]: 2.262922 seconds
Execution time of block[client.beta.threads.runs.create_and_poll() - with tool_outputs]: 1.765574 seconds
Execution time of block[client.beta.threads.messages.list()]: 1.017133 seconds

However, once in a while it takes forever. (This screenshot is from before I started benchmarking each individual client call.)

A 30 second wait is obviously a no-go for user experience.

Edit:

I also wanted to ask if it’s expected for a Run operation (not including the tool submission) to take 5x longer than a chat completion on average. The same tool, prompt, and query takes 0.5-0.7sec in my testing for a completion. I can’t find anything in the documentation about an expected speed difference.

2 Likes

Maybe it isn’t the Run, but in fact the Thread. Capturing more detail shows that on one of the >30sec responses, it was mostly creating the Thread. Will continue testing.

1 Like

I was just whacking the send button on GPT-4o in playground, and there’s been lots of unexpected delays before getting output, using response schema. Don’t know the results of a script to punish that model directly.

One of the interesting things you show in your screenshot is the delay in getting the first tool response, but not the later ones.

Now consider where the backend delays would happen if OpenAI was silently using prefix context caching to reduce their costs on likely input repeats, like system prompts with tool specs.

Fine tune steps have been impossibly slow to spin up.

2 Likes

I wasn’t going to bring it up just yet, but as you noticed there does seem to be some kind of caching going on as well. Perhaps we should start another thread for it to make sure it gets the right amount of attention. I’m much more concerned about those big delays. It’s such a huge amount of time that it’s almost like waiting for a whole VM to spin up. Originally I thought perhaps there was some indexing operation going on with an active Thread, but no - it happens when creating brand new Threads.

1 Like

I got an average delay of 10 sec while using openai completion API with stream enabled. The model that I used - gpt-3.5-turbo-0125.

Here are my logs,

2024-09-11 17:21:36,406 - root - INFO - Agent Time : 6.133776664733887
2024-09-11 17:21:36,550 - root - INFO - Agent Time : 6.277570724487305

2024-09-11 17:21:46,241 - root - INFO - Agent Time : 9.689182043075562
2024-09-11 17:21:46,507 - root - INFO - Agent Time : 9.954988718032837
2024-09-11 17:21:46,509 - root - INFO - Agent Time : 9.957152366638184
2024-09-11 17:21:46,531 - root - INFO - Agent Time : 9.9789137840271

2024-09-11 17:21:55,189 - root - INFO - Agent Time : 8.65730333328247
2024-09-11 17:21:55,365 - root - INFO - Agent Time : 8.833137273788452

2024-09-11 17:22:05,418 - root - INFO - Agent Time : 10.052475690841675
2024-09-11 17:22:05,575 - root - INFO - Agent Time : 10.208851337432861

2024-09-11 17:22:17,587 - root - INFO - Agent Time : 12.011567115783691
2024-09-11 17:22:17,769 - root - INFO - Agent Time : 12.193908452987671

2024-09-11 17:22:29,460 - root - INFO - Agent Time : 11.689696311950684
2024-09-11 17:22:29,666 - root - INFO - Agent Time : 11.895195245742798

2024-09-11 17:22:44,553 - root - INFO - Agent Time : 14.885767698287964
2024-09-11 17:22:44,855 - root - INFO - Agent Time : 15.187988042831421

2024-09-11 17:22:58,446 - root - INFO - Agent Time : 13.589932203292847
2024-09-11 17:22:58,787 - root - INFO - Agent Time : 13.93083930015564

2024-09-11 17:23:14,130 - root - INFO - Agent Time : 15.342631578445435
2024-09-11 17:23:14,315 - root - INFO - Agent Time : 15.527150392532349

2024-09-11 17:23:30,206 - root - INFO - Agent Time : 15.890387058258057
2024-09-11 17:23:30,362 - root - INFO - Agent Time : 16.04568338394165
2024-09-11 17:23:33,066 - root - INFO - Agent Time : 18.7505145072937
2024-09-11 17:23:33,501 - root - INFO - Agent Time : 19.185081005096436
2024-09-11 17:23:33,502 - root - INFO - Agent Time : 19.186529397964478
2024-09-11 17:23:33,503 - root - INFO - Agent Time : 19.1872341632843

@GoldenJoe discovered occasional delay in thread creation - which is outside of AI generation.

Now here’s another thought: On initial thread creation, the AI model isn’t known. Later, what model last fulfilled the thread can be obtained. What can be done with that information then?

Threads are not interesting to me, tho.


I consider “delay” or “latency” to be how long a user has to wait sitting at nothing being produced.

This chat completions bench I had produces a latency stat of getting the first stream chunk (although it is usually without content or has filtered content).

I added “cold” to show the first response, here after considerable inactivity on the “efficient” model names tested. First API call seemed pattern-free on standard models.

For 5 trials of gpt-4o-2024-08-06 @ 2024-09-11 11:43AM:

Stat Average Cold Minimum Maximum
stream rate Avg: 54.780 Cold: 68.7 Min: 23.0 Max: 68.7
latency (s) Avg: 0.451 Cold: 0.37 Min: 0.37 Max: 0.4859
total response (s) Avg: 5.929 Cold: 4.0813 Min: 4.0813 Max: 11.5507
total rate Avg: 49.891 Cold: 62.725 Min: 22.163 Max: 62.725
response tokens Avg: 256.000 Cold: 256 Min: 256 Max: 256

Tokens are counted by the client to ensure reaching max_tokens.

More Models

For 5 trials of gpt-3.5-turbo-0125 @ 2024-09-11 11:43AM:

Stat Average Cold Minimum Maximum
stream rate Avg: 66.780 Cold: 79.1 Min: 51.5 Max: 79.1
latency (s) Avg: 0.446 Cold: 0.6457 Min: 0.2696 Max: 0.6457
total response (s) Avg: 4.351 Cold: 3.8711 Min: 3.8711 Max: 5.4312
total rate Avg: 59.755 Cold: 66.131 Min: 47.135 Max: 66.131
response tokens Avg: 256.000 Cold: 256 Min: 256 Max: 256

For 5 trials of gpt-4o @ 2024-09-11 11:43AM:

Stat Average Cold Minimum Maximum
stream rate Avg: 59.220 Cold: 52.6 Min: 49.2 Max: 81.4
latency (s) Avg: 0.406 Cold: 0.4319 Min: 0.3009 Max: 0.5484
total response (s) Avg: 4.868 Cold: 5.2843 Min: 3.4325 Max: 5.6328
total rate Avg: 54.443 Cold: 48.445 Min: 45.448 Max: 74.581
response tokens Avg: 256.000 Cold: 256 Min: 256 Max: 256

For 5 trials of gpt-4o-mini @ 2024-09-11 11:43AM:

Stat Average Cold Minimum Maximum
stream rate Avg: 96.160 Cold: 117.2 Min: 79.1 Max: 117.2
latency (s) Avg: 0.327 Cold: 0.3634 Min: 0.2319 Max: 0.3699
total response (s) Avg: 3.039 Cold: 2.5391 Min: 2.5391 Max: 3.5956
total rate Avg: 85.830 Cold: 100.823 Min: 71.198 Max: 100.823
response tokens Avg: 256.000 Cold: 256 Min: 256 Max: 256

I’ll mention an earlier finding, and how it is working on new models: temperature and top_p altered from defaults slows down token generation. Especially a ridiculous high temperature constrained by top_p drags GPT-4 to a crawl.

temp and top_p trials = [(1, 1), (0, 0), (1.8, 0.1)] on “mini”

AI analysis:

Based on the provided report detailing trials on the GPT-4O-mini model with varying temperature (temp) and top_p settings, here is a summary and analysis of the AI model operations and speeds:

  • Standard Settings (temp=1, top_p=1):
    • This setting yields the best performance in terms of stream rate and total rate, with averages of 113.320 and 103.161 respectively. It also maintains a moderate average latency of 0.425 seconds.
  • Zero Settings (temp=0, top_p=0):
    • Reducing the temperature and top_p to zero results in a lower stream rate and total rate, with noticeable reductions in efficiency. Average stream rate and total rate are 102.880 and 95.323 respectively, which are lower than the standard settings.
  • High Temp with Low Top_p (temp=1.8, top_p=0.1):
    • This setting shows the lowest performance in terms of stream rate (average 99.040) and total rate (average 89.766). Additionally, it has the highest average latency of 0.526 seconds, indicating slower response times.

Efficiency and Responsiveness:

  • The efficiency (as measured by total rate and stream rate) decreases as the temperature increases and top_p decreases. The settings with temp=1, top_p=1 are optimal for maintaining a balance between speed and output consistency.
  • Latency increases with higher temperature and lower top_p, indicating that the model struggles with token generation under these conditions, leading to slower response times.

Also: GPT-4o inserts its opinions that diversity is good in any sampling discussion, not asked here at all, with a hallucinated analysis more from training and mirroring user prompt, therefore its output must be rejected.

High Temperature with Low Top_p (Temp=1.8, Top_p=0.1): Provides the greatest token variety but results in significantly slower responses

Data

For 5 trials of gpt-4o-mini, temp=1, topp=1 @ 2024-09-11 12:43PM:

Stat Average Cold Minimum Maximum
stream rate Avg: 113.320 Cold: 126.4 Min: 91.9 Max: 133.3
latency (s) Avg: 0.425 Cold: 0.7399 Min: 0.2651 Max: 0.7399
total response (s) Avg: 5.014 Cold: 4.7829 Min: 4.2667 Max: 5.827
total rate Avg: 103.161 Cold: 107.048 Min: 87.867 Max: 119.999
response tokens Avg: 512.000 Cold: 512 Min: 512 Max: 512

For 5 trials of gpt-4o-mini, temp=0, topp=0 @ 2024-09-11 12:43PM:

Stat Average Cold Minimum Maximum
stream rate Avg: 102.880 Cold: 118.5 Min: 79.2 Max: 118.5
latency (s) Avg: 0.398 Cold: 0.4008 Min: 0.2852 Max: 0.6003
total response (s) Avg: 5.471 Cold: 4.7138 Min: 4.6677 Max: 6.7518
total rate Avg: 95.323 Cold: 108.617 Min: 75.832 Max: 109.69
response tokens Avg: 512.000 Cold: 512 Min: 512 Max: 512

For 5 trials of gpt-4o-mini, temp=1.8, topp=0.1 @ 2024-09-11 12:43PM:

Stat Average Cold Minimum Maximum
stream rate Avg: 99.040 Cold: 106.5 Min: 87.7 Max: 114.8
latency (s) Avg: 0.526 Cold: 0.4165 Min: 0.3544 Max: 0.8488
total response (s) Avg: 5.737 Cold: 5.2154 Min: 5.2154 Max: 6.4038
total rate Avg: 89.766 Cold: 98.171 Min: 79.953 Max: 98.171
response tokens Avg: 512.000 Cold: 512 Min: 512 Max: 512
2 Likes

I am having exactly same issue and its getting worst last 3 days, I have been trying to ask for help from Openai support but to no help.

I’m bout to make my app live and Assistant API has been really bad and has shaken my confidence in OpenAI.

this has been the case In the morning PST time yesterday(9/10) and today(9/11) — sometimes streaming won’t even receive response and it would take more than 30 seconds almost every other request.

Some times it was taking time in thread creation but especially today, it was taking long time (20 seconds) for this.
client.beta.threads.messages.create(
thread_id=thread_id, role=“user”, content=user_message)

after this, it would take 30 plus second to receive first chunk. I played around a lot and still it did not work. I posted many messages and after 4-5 hours it got better. now around 1 PM its back to normal speed.

This was the case as early as 6 AM today and yesterday but it started getting better around 10 and improved after 12 PM PST, so I’m guessing may be some Big batch jobs may be hogging up the servers. ??