We are using javscript OpenAI client from AWS Lambda with “gpt-4o” model.
We had the same application request running under 5 mins in lambda function (which was our intended architecture) and was working perfectly last Friday (11th Oct 2024). And suddenly on this yesterday Monday (14th Oct 2024) there’s a severe delay and our requests are timing out and intermittently one or two works.
No changes were done from our end or cloud configuration.
Need to understand how to lodge a support ticket without AI assistance as it ends up suggesting your documentation. Which doesn’t help.
Is there a way to look at your logs and statistics of your response times for our requests. Rather we want to get this fixed. There are no RPM TPM limits reached and we have around $1000 credit on account. So there’s clearly no rate limiting or throttling.
Performance is down even further from six hours ago. Benchmarking again:
For 3 trials of gpt-4o-2024-08-06 @ 2024-10-15 06:21AM:
Stat
Average
Cold
Minimum
Maximum
stream rate
Avg: 29.233
Cold: 27.5
Min: 27.5
Max: 30.4
latency (s)
Avg: 0.679
Cold: 0.6909
Min: 0.4539
Max: 0.8909
total response (s)
Avg: 18.175
Cold: 19.2412
Min: 17.5793
Max: 19.2412
total rate
Avg: 28.217
Cold: 26.61
Min: 26.61
Max: 29.125
response tokens
Avg: 512.000
Cold: 512
Min: 512
Max: 512
For 3 trials of gpt-4o-2024-05-13 @ 2024-10-15 06:21AM:
Stat
Average
Cold
Minimum
Maximum
stream rate
Avg: 51.333
Cold: 57.5
Min: 46.7
Max: 57.5
latency (s)
Avg: 0.620
Cold: 0.512
Min: 0.512
Max: 0.703
total response (s)
Avg: 10.649
Cold: 9.3955
Min: 9.3955
Max: 11.5906
total rate
Avg: 48.461
Cold: 54.494
Min: 44.174
Max: 54.494
response tokens
Avg: 512.000
Cold: 512
Min: 512
Max: 512
42 → 28 on gpt-4o
85 → 48 on gpt-4o-2024-05-13
27 on gpt-4-turbo
If you are not using the specific features of structured output, you could switch to that versioned model currently performing better.
From past continuous analysis, 6am-9am seems to be the peak slowness time on weekdays, maybe moreso today with yesterday being a US holiday, and everyone getting back to work with their AI questions. You can really see the chunk progress pause and struggle, as though inference is time-slicing between users.
Thanks for this, so I know I’m not the only one receiving delayed api responses from 4o with structured outputs…about 4-5 seconds on a small token request. Have you tested with Azure’s api?
There will be delay in receiving the first token when using structured outputs and an original or changed JSON schema for the first time - up to 10 seconds for building a parser index which is cached. So you will not be the only one, as that is an expected artifact of the technology.
This process, and cache lookup, will likely be affected by different computational resources than language inference which is reported to be underperforming from expectations and past use.