First, I encountered lag, and then an API error occurred

Here are my logs:

2025-05-19 17:09:32,904 - INFO - event_handler.py - [_handle_run_created] Run 'run_GeouQ7Kxwq4KAWuKYf3ze8ix' created for thread 'thread_pg3RT23KONlxNsJuDxTm7c3i'.
2025-05-19 17:09:36,648 - INFO - event_handler.py - [_handle_run_queued] Run queued for thread 'thread_pg3RT23KONlxNsJuDxTm7c3i'.
2025-05-19 17:10:28,572 - ERROR - event_handler.py - [start_run] Error during run: The server had an error processing your request. Sorry about that! You can retry your request, or contact us through our help center at help.openai.com if you keep seeing this error. (Please include the request ID req_9fabd5879874b0985105e99f395fd40d in your email.)
Traceback (most recent call last):
  File "/home/razvansavin/Proiecte/flexiai-toolsmith/flexiai/core/handlers/event_handler.py", line 73, in start_run
    async for event in run_stream:
    ...<4 lines>...
        self.event_dispatcher.dispatch(etype, event, thread_id)
  File "/home/razvansavin/miniconda3/envs/.conda_flexiai/lib/python3.13/site-packages/openai/_streaming.py", line 147, in __aiter__
    async for item in self._iterator:
        yield item
  File "/home/razvansavin/miniconda3/envs/.conda_flexiai/lib/python3.13/site-packages/openai/_streaming.py", line 193, in __stream__
    raise APIError(
    ...<3 lines>...
    )
openai.APIError: The server had an error processing your request. Sorry about that! You can retry your request, or contact us through our help center at help.openai.com if you keep seeing this error. (Please include the request ID req_9fabd5879874b0985105e99f395fd40d in your email.)
2025-05-19 17:10:28,575 - DEBUG - _trace.py - response_closed.started
2025-05-19 17:10:28,575 - DEBUG - _trace.py - receive_response_body.failed exception=GeneratorExit()
2025-05-19 17:10:28,576 - DEBUG - _trace.py - response_closed.complete


First step is to write some code that will accurately report 500 status errors like you might have received, and have some error-handling logic that might retry a few times when it is not 404 or 429 types of “bad model ID” or “not paying your bill”…

https://platform.openai.com/docs/guides/error-codes

Then, what model is under discussion?

I used gpt-4o-mini and still have $10.
EDIT: Tested again and I see improvements.

You might see how portable your application is to gpt-4.1-mini. It outperforms except in wanting to write over 1500 tokens.

Here’s the performance of each right now, all models launched asyncio in parallel together. And it is night, between 7am London and midnight California.

1024 max tokens

Model Trials Avg Latency (s) Avg Rate (tokens/s)
gpt-4o-mini 10 0.938 36.083
gpt-4.1-mini 10 0.749 72.579

512 max tokens

Model Trials Avg Latency (s) Avg Rate (tokens/s)
gpt-4o-mini 3 0.704 49.785
gpt-4.1-mini 3 0.753 62.024

128 max tokens

Model Trials Avg Latency (s) Avg Rate (tokens/s)
gpt-4o-mini 3 0.812 38.981
gpt-4.1-mini 3 0.682 59.107

(caching is broken by a varying nonce at token input 0..)

1 Like

And then, in five hours, the generation rate of models has reversed…nullifying a recommendation.

1024 max tokens before

Model Trials Avg Latency (s) Avg Rate (tokens/s)
gpt-4o-mini 10 0.891 43.081
gpt-4.1-mini 10 0.683 65.232

1024 max tokens now

Model Trials Avg Latency (s) Avg Rate (tokens/s)
gpt-4o-mini 10 0.951 51.417
gpt-4.1-mini 10 1.084 37.569

768 max tokens now

Model Trials Avg Latency (s) Avg Rate (tokens/s)
gpt-4o-mini 10 0.759 53.049
gpt-4.1-mini 10 0.751 39.203

If curious, this blast of calls gets any cache disrupted with an initial inserted random system message that is “{session_id} is chat session ID.”, differing not just in content but in token length. The request is not large enough to receive a cached discount. However, my prior statistical distributions found that even when not discounted, there was difference correlated to cacheability.

top_p: 0.001 reflects the desire of any developer to control sampling, where departure from default in temperature or top_p also affects performance.

011
870694
192728562
318594459065
294490027466378
292691559810764014
069141475949951893187
306070117383162570272003
831484445272805486878781564
982814277251838824698011873434
954829967379650690909346590862592
458714204821044389627179228435846126
366273607463525488234825564357321280023
298319909803501411099723503965263403745793
640397000481629968082786311860899616432119908
925792740705724365379669505448589547656777982421
516486298580211071938307766531419472360332080576986
052517981526082022117191542895851280998512358126848251
070347151877136915331394882160523063605568533412750825181
117053800018121052190908638028822232695888423787844214209839
716
555525
207421707
937159959367
646461467424236
421723921309670050
719551001757333572050
332575342619161089739050
506833403456697150009272194
827810134041012003821518653439
428385452719398099113605210539633
914397146599788341390831456558868248
528052564130914569097753680305609806241
042107896578133647294886224949703719971186
341474072152270760139004577249271343872207661
658780600599666828391154711464019280033047961091
663061316446079529370032252339957196898969340713810
325533249746091983735767660503162821555290219584335185
652093775667321088542384668336368138646710677149314991137
247616427411582439843694436142634272198624744919597770713473
1 Like

Thank you, Jay, you gave me some ideas :heart:
hug-hugs

1 Like

Another idea: don’t pass your call through Assistants as a middleman - with its own delays that can vary. Multiple calls are required to set it up and run, with little benefit except for its internal tools and thread reuse, reducing the network transmission of historic messages.

1 Like