Runs randomly take > 30sec

GoldenJoe · September 11, 2024, 6:37am

After some testing, I can confirm that there seems to be some kind of bug on OpenAI’s end.

This is the simplest tool call that I can think to generate using a Run:

def run_with_empty_thread(config:TestConfig) -> TestResult:
   client = get_client()
   start_time = timeit.default_timer()
   # with time_block("run_with_empty_thread", print_output=False) as exec_time:
   with time_block('client.beta.threads.create()'):
      thread = client.beta.threads.create()
   with time_block('client.beta.threads.runs.create_and_poll()'):
      run = client.beta.threads.runs.create_and_poll(
         assistant_id=config.assistant_id,
         thread_id=thread.id,
         model=config.model,
         additional_instructions = config.prompt,
         additional_messages = [{'role': 'user', 'content': config.query}],
         tools=config.tools
      )
   if run.status == 'completed': # will only complete immediately if there are no tools
      with time_block('client.beta.threads.messages.list()'):
         messages = client.beta.threads.messages.list(thread_id=thread.id, limit=1)
      exec_time = timeit.default_timer() - start_time
      result = TestResult(
         completion_time=exec_time,
         tokens_total=run.usage.total_tokens,
         tokens_prompt=run.usage.prompt_tokens,
         tokens_completion=run.usage.completion_tokens,
         tools=[],
         response = messages.data[0].content[0].text.value
      )
      return result
   elif run.status == 'requires_action': # when tools are required, we have to submit a second run
      tool_outputs = [] # fake the output, we only care about openai speed
      tools = run.required_action.submit_tool_outputs.tool_calls
      for tool_call in run.required_action.submit_tool_outputs.tool_calls:
         tool_outputs.append({'tool_call_id': tool_call.id, 'output': 'success'}) # args are: tool_call.function.arguments
      with time_block('client.beta.threads.runs.create_and_poll() - with tool_outputs'):
         run = client.beta.threads.runs.submit_tool_outputs_and_poll(
            run_id=run.id,
            thread_id=run.thread_id,
            tool_outputs = tool_outputs
         )
      if run.status == 'completed':
         with time_block('client.beta.threads.messages.list()'):
            messages = client.beta.threads.messages.list(thread_id=thread.id, limit=1)
         exec_time = timeit.default_timer() - start_time
         result = TestResult(
            completion_time=exec_time,
            tokens_total=run.usage.total_tokens,
            tokens_prompt=run.usage.prompt_tokens,
            tokens_completion=run.usage.completion_tokens,
            tools=tools,
            response = messages.data[0].content[0].text.value
         )
         return result
      else:
         print(f"Unexpected run status: [{run.status}]")
   else:
      print(f"Unexpected run status: [{run.status}]")

On average, this is about a 5 second process:

Execution time of block[client.beta.threads.create()]: 0.200420 seconds
Execution time of block[client.beta.threads.runs.create_and_poll()]: 2.262922 seconds
Execution time of block[client.beta.threads.runs.create_and_poll() - with tool_outputs]: 1.765574 seconds
Execution time of block[client.beta.threads.messages.list()]: 1.017133 seconds

However, once in a while it takes forever. (This screenshot is from before I started benchmarking each individual client call.)

A 30 second wait is obviously a no-go for user experience.

Edit:

I also wanted to ask if it’s expected for a Run operation (not including the tool submission) to take 5x longer than a chat completion on average. The same tool, prompt, and query takes 0.5-0.7sec in my testing for a completion. I can’t find anything in the documentation about an expected speed difference.

GoldenJoe · September 11, 2024, 9:03am

Maybe it isn’t the Run, but in fact the Thread. Capturing more detail shows that on one of the >30sec responses, it was mostly creating the Thread. Will continue testing.

_j · September 11, 2024, 9:23am

I was just whacking the send button on GPT-4o in playground, and there’s been lots of unexpected delays before getting output, using response schema. Don’t know the results of a script to punish that model directly.

One of the interesting things you show in your screenshot is the delay in getting the first tool response, but not the later ones.

Now consider where the backend delays would happen if OpenAI was silently using prefix context caching to reduce their costs on likely input repeats, like system prompts with tool specs.

Fine tune steps have been impossibly slow to spin up.

GoldenJoe · September 11, 2024, 9:34am

I wasn’t going to bring it up just yet, but as you noticed there does seem to be some kind of caching going on as well. Perhaps we should start another thread for it to make sure it gets the right amount of attention. I’m much more concerned about those big delays. It’s such a huge amount of time that it’s almost like waiting for a whole VM to spin up. Originally I thought perhaps there was some indexing operation going on with an active Thread, but no - it happens when creating brand new Threads.

vishwakneelamegam · September 11, 2024, 5:57pm

I got an average delay of 10 sec while using openai completion API with stream enabled. The model that I used - gpt-3.5-turbo-0125.

Here are my logs,

2024-09-11 17:21:36,406 - root - INFO - Agent Time : 6.133776664733887
2024-09-11 17:21:36,550 - root - INFO - Agent Time : 6.277570724487305

2024-09-11 17:21:46,241 - root - INFO - Agent Time : 9.689182043075562
2024-09-11 17:21:46,507 - root - INFO - Agent Time : 9.954988718032837
2024-09-11 17:21:46,509 - root - INFO - Agent Time : 9.957152366638184
2024-09-11 17:21:46,531 - root - INFO - Agent Time : 9.9789137840271

2024-09-11 17:21:55,189 - root - INFO - Agent Time : 8.65730333328247
2024-09-11 17:21:55,365 - root - INFO - Agent Time : 8.833137273788452

2024-09-11 17:22:05,418 - root - INFO - Agent Time : 10.052475690841675
2024-09-11 17:22:05,575 - root - INFO - Agent Time : 10.208851337432861

2024-09-11 17:22:17,587 - root - INFO - Agent Time : 12.011567115783691
2024-09-11 17:22:17,769 - root - INFO - Agent Time : 12.193908452987671

2024-09-11 17:22:29,460 - root - INFO - Agent Time : 11.689696311950684
2024-09-11 17:22:29,666 - root - INFO - Agent Time : 11.895195245742798

2024-09-11 17:22:44,553 - root - INFO - Agent Time : 14.885767698287964
2024-09-11 17:22:44,855 - root - INFO - Agent Time : 15.187988042831421

2024-09-11 17:22:58,446 - root - INFO - Agent Time : 13.589932203292847
2024-09-11 17:22:58,787 - root - INFO - Agent Time : 13.93083930015564

2024-09-11 17:23:14,130 - root - INFO - Agent Time : 15.342631578445435
2024-09-11 17:23:14,315 - root - INFO - Agent Time : 15.527150392532349

2024-09-11 17:23:30,206 - root - INFO - Agent Time : 15.890387058258057
2024-09-11 17:23:30,362 - root - INFO - Agent Time : 16.04568338394165
2024-09-11 17:23:33,066 - root - INFO - Agent Time : 18.7505145072937
2024-09-11 17:23:33,501 - root - INFO - Agent Time : 19.185081005096436
2024-09-11 17:23:33,502 - root - INFO - Agent Time : 19.186529397964478
2024-09-11 17:23:33,503 - root - INFO - Agent Time : 19.1872341632843

_j · September 11, 2024, 8:07pm

@GoldenJoe discovered occasional delay in thread creation - which is outside of AI generation.

Now here’s another thought: On initial thread creation, the AI model isn’t known. Later, what model last fulfilled the thread can be obtained. What can be done with that information then?

Threads are not interesting to me, tho.

I consider “delay” or “latency” to be how long a user has to wait sitting at nothing being produced.

This chat completions bench I had produces a latency stat of getting the first stream chunk (although it is usually without content or has filtered content).

I added “cold” to show the first response, here after considerable inactivity on the “efficient” model names tested. First API call seemed pattern-free on standard models.

For 5 trials of gpt-4o-2024-08-06 @ 2024-09-11 11:43AM:

Stat	Average	Cold	Minimum	Maximum
stream rate	Avg: 54.780	Cold: 68.7	Min: 23.0	Max: 68.7
latency (s)	Avg: 0.451	Cold: 0.37	Min: 0.37	Max: 0.4859
total response (s)	Avg: 5.929	Cold: 4.0813	Min: 4.0813	Max: 11.5507
total rate	Avg: 49.891	Cold: 62.725	Min: 22.163	Max: 62.725
response tokens	Avg: 256.000	Cold: 256	Min: 256	Max: 256

Tokens are counted by the client to ensure reaching max_tokens.

More Models

For 5 trials of gpt-3.5-turbo-0125 @ 2024-09-11 11:43AM:

Stat	Average	Cold	Minimum	Maximum
stream rate	Avg: 66.780	Cold: 79.1	Min: 51.5	Max: 79.1
latency (s)	Avg: 0.446	Cold: 0.6457	Min: 0.2696	Max: 0.6457
total response (s)	Avg: 4.351	Cold: 3.8711	Min: 3.8711	Max: 5.4312
total rate	Avg: 59.755	Cold: 66.131	Min: 47.135	Max: 66.131
response tokens	Avg: 256.000	Cold: 256	Min: 256	Max: 256

For 5 trials of gpt-4o @ 2024-09-11 11:43AM:

Stat	Average	Cold	Minimum	Maximum
stream rate	Avg: 59.220	Cold: 52.6	Min: 49.2	Max: 81.4
latency (s)	Avg: 0.406	Cold: 0.4319	Min: 0.3009	Max: 0.5484
total response (s)	Avg: 4.868	Cold: 5.2843	Min: 3.4325	Max: 5.6328
total rate	Avg: 54.443	Cold: 48.445	Min: 45.448	Max: 74.581
response tokens	Avg: 256.000	Cold: 256	Min: 256	Max: 256

For 5 trials of gpt-4o-mini @ 2024-09-11 11:43AM:

Stat	Average	Cold	Minimum	Maximum
stream rate	Avg: 96.160	Cold: 117.2	Min: 79.1	Max: 117.2
latency (s)	Avg: 0.327	Cold: 0.3634	Min: 0.2319	Max: 0.3699
total response (s)	Avg: 3.039	Cold: 2.5391	Min: 2.5391	Max: 3.5956
total rate	Avg: 85.830	Cold: 100.823	Min: 71.198	Max: 100.823
response tokens	Avg: 256.000	Cold: 256	Min: 256	Max: 256

I’ll mention an earlier finding, and how it is working on new models: temperature and top_p altered from defaults slows down token generation. Especially a ridiculous high temperature constrained by top_p drags GPT-4 to a crawl.

temp and top_p trials = [(1, 1), (0, 0), (1.8, 0.1)] on “mini”

AI analysis:

Based on the provided report detailing trials on the GPT-4O-mini model with varying temperature (temp) and top_p settings, here is a summary and analysis of the AI model operations and speeds:

Standard Settings (temp=1, top_p=1):
- This setting yields the best performance in terms of stream rate and total rate, with averages of 113.320 and 103.161 respectively. It also maintains a moderate average latency of 0.425 seconds.
Zero Settings (temp=0, top_p=0):
- Reducing the temperature and top_p to zero results in a lower stream rate and total rate, with noticeable reductions in efficiency. Average stream rate and total rate are 102.880 and 95.323 respectively, which are lower than the standard settings.
High Temp with Low Top_p (temp=1.8, top_p=0.1):
- This setting shows the lowest performance in terms of stream rate (average 99.040) and total rate (average 89.766). Additionally, it has the highest average latency of 0.526 seconds, indicating slower response times.

Efficiency and Responsiveness:

The efficiency (as measured by total rate and stream rate) decreases as the temperature increases and top_p decreases. The settings with temp=1, top_p=1 are optimal for maintaining a balance between speed and output consistency.
Latency increases with higher temperature and lower top_p, indicating that the model struggles with token generation under these conditions, leading to slower response times.

Also: GPT-4o inserts its opinions that diversity is good in any sampling discussion, not asked here at all, with a hallucinated analysis more from training and mirroring user prompt, therefore its output must be rejected.

High Temperature with Low Top_p (Temp=1.8, Top_p=0.1): Provides the greatest token variety but results in significantly slower responses

Data

For 5 trials of gpt-4o-mini, temp=1, topp=1 @ 2024-09-11 12:43PM:

Stat	Average	Cold	Minimum	Maximum
stream rate	Avg: 113.320	Cold: 126.4	Min: 91.9	Max: 133.3
latency (s)	Avg: 0.425	Cold: 0.7399	Min: 0.2651	Max: 0.7399
total response (s)	Avg: 5.014	Cold: 4.7829	Min: 4.2667	Max: 5.827
total rate	Avg: 103.161	Cold: 107.048	Min: 87.867	Max: 119.999
response tokens	Avg: 512.000	Cold: 512	Min: 512	Max: 512

For 5 trials of gpt-4o-mini, temp=0, topp=0 @ 2024-09-11 12:43PM:

Stat	Average	Cold	Minimum	Maximum
stream rate	Avg: 102.880	Cold: 118.5	Min: 79.2	Max: 118.5
latency (s)	Avg: 0.398	Cold: 0.4008	Min: 0.2852	Max: 0.6003
total response (s)	Avg: 5.471	Cold: 4.7138	Min: 4.6677	Max: 6.7518
total rate	Avg: 95.323	Cold: 108.617	Min: 75.832	Max: 109.69
response tokens	Avg: 512.000	Cold: 512	Min: 512	Max: 512

For 5 trials of gpt-4o-mini, temp=1.8, topp=0.1 @ 2024-09-11 12:43PM:

Stat	Average	Cold	Minimum	Maximum
stream rate	Avg: 99.040	Cold: 106.5	Min: 87.7	Max: 114.8
latency (s)	Avg: 0.526	Cold: 0.4165	Min: 0.3544	Max: 0.8488
total response (s)	Avg: 5.737	Cold: 5.2154	Min: 5.2154	Max: 6.4038
total rate	Avg: 89.766	Cold: 98.171	Min: 79.953	Max: 98.171
response tokens	Avg: 512.000	Cold: 512	Min: 512	Max: 512

npatel · September 11, 2024, 8:09pm

I am having exactly same issue and its getting worst last 3 days, I have been trying to ask for help from Openai support but to no help.

I’m bout to make my app live and Assistant API has been really bad and has shaken my confidence in OpenAI.

this has been the case In the morning PST time yesterday(9/10) and today(9/11) — sometimes streaming won’t even receive response and it would take more than 30 seconds almost every other request.

npatel · September 11, 2024, 8:17pm

Some times it was taking time in thread creation but especially today, it was taking long time (20 seconds) for this.
client.beta.threads.messages.create(
thread_id=thread_id, role=“user”, content=user_message)

after this, it would take 30 plus second to receive first chunk. I played around a lot and still it did not work. I posted many messages and after 4-5 hours it got better. now around 1 PM its back to normal speed.

This was the case as early as 6 AM today and yesterday but it started getting better around 10 and improved after 12 PM PST, so I’m guessing may be some Big batch jobs may be hogging up the servers. ??

Topic		Replies	Views
Assistant API takes long to respond Bugs gpt-4 , api	12	2092	August 27, 2024
Chat GPT's API is significantly slower than the website with GPT Plus API	35	35628	December 12, 2023
Assistants API too slow for realtime/production? API assistants	42	12310	September 16, 2024
Completions API Suddenly slow API gpt4o	4	109	October 15, 2024
GPT-3.5 API is very slow. Any fix? API	31	9790	October 12, 2023