2025-05-19 17:09:32,904 - INFO - event_handler.py - [_handle_run_created] Run 'run_GeouQ7Kxwq4KAWuKYf3ze8ix' created for thread 'thread_pg3RT23KONlxNsJuDxTm7c3i'.
2025-05-19 17:09:36,648 - INFO - event_handler.py - [_handle_run_queued] Run queued for thread 'thread_pg3RT23KONlxNsJuDxTm7c3i'.
2025-05-19 17:10:28,572 - ERROR - event_handler.py - [start_run] Error during run: The server had an error processing your request. Sorry about that! You can retry your request, or contact us through our help center at help.openai.com if you keep seeing this error. (Please include the request ID req_9fabd5879874b0985105e99f395fd40d in your email.)
Traceback (most recent call last):
File "/home/razvansavin/Proiecte/flexiai-toolsmith/flexiai/core/handlers/event_handler.py", line 73, in start_run
async for event in run_stream:
...<4 lines>...
self.event_dispatcher.dispatch(etype, event, thread_id)
File "/home/razvansavin/miniconda3/envs/.conda_flexiai/lib/python3.13/site-packages/openai/_streaming.py", line 147, in __aiter__
async for item in self._iterator:
yield item
File "/home/razvansavin/miniconda3/envs/.conda_flexiai/lib/python3.13/site-packages/openai/_streaming.py", line 193, in __stream__
raise APIError(
...<3 lines>...
)
openai.APIError: The server had an error processing your request. Sorry about that! You can retry your request, or contact us through our help center at help.openai.com if you keep seeing this error. (Please include the request ID req_9fabd5879874b0985105e99f395fd40d in your email.)
2025-05-19 17:10:28,575 - DEBUG - _trace.py - response_closed.started
2025-05-19 17:10:28,575 - DEBUG - _trace.py - receive_response_body.failed exception=GeneratorExit()
2025-05-19 17:10:28,576 - DEBUG - _trace.py - response_closed.complete
First step is to write some code that will accurately report 500 status errors like you might have received, and have some error-handling logic that might retry a few times when it is not 404 or 429 types of “bad model ID” or “not paying your bill”…
You might see how portable your application is to gpt-4.1-mini. It outperforms except in wanting to write over 1500 tokens.
Here’s the performance of each right now, all models launched asyncio in parallel together. And it is night, between 7am London and midnight California.
1024 max tokens
Model
Trials
Avg Latency (s)
Avg Rate (tokens/s)
gpt-4o-mini
10
0.938
36.083
gpt-4.1-mini
10
0.749
72.579
512 max tokens
Model
Trials
Avg Latency (s)
Avg Rate (tokens/s)
gpt-4o-mini
3
0.704
49.785
gpt-4.1-mini
3
0.753
62.024
128 max tokens
Model
Trials
Avg Latency (s)
Avg Rate (tokens/s)
gpt-4o-mini
3
0.812
38.981
gpt-4.1-mini
3
0.682
59.107
(caching is broken by a varying nonce at token input 0..)
And then, in five hours, the generation rate of models has reversed…nullifying a recommendation.
1024 max tokens before
Model
Trials
Avg Latency (s)
Avg Rate (tokens/s)
gpt-4o-mini
10
0.891
43.081
gpt-4.1-mini
10
0.683
65.232
1024 max tokens now
Model
Trials
Avg Latency (s)
Avg Rate (tokens/s)
gpt-4o-mini
10
0.951
51.417
gpt-4.1-mini
10
1.084
37.569
768 max tokens now
Model
Trials
Avg Latency (s)
Avg Rate (tokens/s)
gpt-4o-mini
10
0.759
53.049
gpt-4.1-mini
10
0.751
39.203
If curious, this blast of calls gets any cache disrupted with an initial inserted random system message that is “{session_id} is chat session ID.”, differing not just in content but in token length. The request is not large enough to receive a cached discount. However, my prior statistical distributions found that even when not discounted, there was difference correlated to cacheability.
top_p: 0.001 reflects the desire of any developer to control sampling, where departure from default in temperature or top_p also affects performance.
Another idea: don’t pass your call through Assistants as a middleman - with its own delays that can vary. Multiple calls are required to set it up and run, with little benefit except for its internal tools and thread reuse, reducing the network transmission of historic messages.