Completions API Suddenly slow

We are using javscript OpenAI client from AWS Lambda with “gpt-4o” model.

We had the same application request running under 5 mins in lambda function (which was our intended architecture) and was working perfectly last Friday (11th Oct 2024). And suddenly on this yesterday Monday (14th Oct 2024) there’s a severe delay and our requests are timing out and intermittently one or two works.

No changes were done from our end or cloud configuration.

Need to understand how to lodge a support ticket without AI assistance as it ends up suggesting your documentation. Which doesn’t help.

Is there a way to look at your logs and statistics of your response times for our requests. Rather we want to get this fixed. There are no RPM TPM limits reached and we have around $1000 credit on account. So there’s clearly no rate limiting or throttling.

Please help

To add to this, we have already gone through the documentation https://platform.openai.com/docs/guides/latency-optimization

This is not the case, as this was working fine. We are using your completions API via javascript “openAIClient.chat.completions.create”.

The problem is that there is a sudden significant drop in performance.

Performance is down even further from six hours ago. Benchmarking again:

For 3 trials of gpt-4o-2024-08-06 @ 2024-10-15 06:21AM:

Stat Average Cold Minimum Maximum
stream rate Avg: 29.233 Cold: 27.5 Min: 27.5 Max: 30.4
latency (s) Avg: 0.679 Cold: 0.6909 Min: 0.4539 Max: 0.8909
total response (s) Avg: 18.175 Cold: 19.2412 Min: 17.5793 Max: 19.2412
total rate Avg: 28.217 Cold: 26.61 Min: 26.61 Max: 29.125
response tokens Avg: 512.000 Cold: 512 Min: 512 Max: 512

For 3 trials of gpt-4o-2024-05-13 @ 2024-10-15 06:21AM:

Stat Average Cold Minimum Maximum
stream rate Avg: 51.333 Cold: 57.5 Min: 46.7 Max: 57.5
latency (s) Avg: 0.620 Cold: 0.512 Min: 0.512 Max: 0.703
total response (s) Avg: 10.649 Cold: 9.3955 Min: 9.3955 Max: 11.5906
total rate Avg: 48.461 Cold: 54.494 Min: 44.174 Max: 54.494
response tokens Avg: 512.000 Cold: 512 Min: 512 Max: 512

42 → 28 on gpt-4o
85 → 48 on gpt-4o-2024-05-13
27 on gpt-4-turbo

If you are not using the specific features of structured output, you could switch to that versioned model currently performing better.

From past continuous analysis, 6am-9am seems to be the peak slowness time on weekdays, maybe moreso today with yesterday being a US holiday, and everyone getting back to work with their AI questions. You can really see the chunk progress pause and struggle, as though inference is time-slicing between users.

Hopefully the data ops people will be on this.

A week of performance:

3 Likes

Thanks for this, so I know I’m not the only one receiving delayed api responses from 4o with structured outputs…about 4-5 seconds on a small token request. Have you tested with Azure’s api?

There will be delay in receiving the first token when using structured outputs and an original or changed JSON schema for the first time - up to 10 seconds for building a parser index which is cached. So you will not be the only one, as that is an expected artifact of the technology.

This process, and cache lookup, will likely be affected by different computational resources than language inference which is reported to be underperforming from expectations and past use.