Completions API Suddenly slow

chathushka.p · October 15, 2024, 1:00pm

We are using javscript OpenAI client from AWS Lambda with “gpt-4o” model.

We had the same application request running under 5 mins in lambda function (which was our intended architecture) and was working perfectly last Friday (11th Oct 2024). And suddenly on this yesterday Monday (14th Oct 2024) there’s a severe delay and our requests are timing out and intermittently one or two works.

No changes were done from our end or cloud configuration.

Need to understand how to lodge a support ticket without AI assistance as it ends up suggesting your documentation. Which doesn’t help.

Is there a way to look at your logs and statistics of your response times for our requests. Rather we want to get this fixed. There are no RPM TPM limits reached and we have around $1000 credit on account. So there’s clearly no rate limiting or throttling.

Please help

chathushka.p · October 15, 2024, 1:06pm

To add to this, we have already gone through the documentation https://platform.openai.com/docs/guides/latency-optimization

This is not the case, as this was working fine. We are using your completions API via javascript “openAIClient.chat.completions.create”.

The problem is that there is a sudden significant drop in performance.

_j · October 15, 2024, 1:33pm

Performance is down even further from six hours ago. Benchmarking again:

For 3 trials of gpt-4o-2024-08-06 @ 2024-10-15 06:21AM:

Stat	Average	Cold	Minimum	Maximum
stream rate	Avg: 29.233	Cold: 27.5	Min: 27.5	Max: 30.4
latency (s)	Avg: 0.679	Cold: 0.6909	Min: 0.4539	Max: 0.8909
total response (s)	Avg: 18.175	Cold: 19.2412	Min: 17.5793	Max: 19.2412
total rate	Avg: 28.217	Cold: 26.61	Min: 26.61	Max: 29.125
response tokens	Avg: 512.000	Cold: 512	Min: 512	Max: 512

For 3 trials of gpt-4o-2024-05-13 @ 2024-10-15 06:21AM:

Stat	Average	Cold	Minimum	Maximum
stream rate	Avg: 51.333	Cold: 57.5	Min: 46.7	Max: 57.5
latency (s)	Avg: 0.620	Cold: 0.512	Min: 0.512	Max: 0.703
total response (s)	Avg: 10.649	Cold: 9.3955	Min: 9.3955	Max: 11.5906
total rate	Avg: 48.461	Cold: 54.494	Min: 44.174	Max: 54.494
response tokens	Avg: 512.000	Cold: 512	Min: 512	Max: 512

42 → 28 on gpt-4o
85 → 48 on gpt-4o-2024-05-13
27 on gpt-4-turbo

If you are not using the specific features of structured output, you could switch to that versioned model currently performing better.

From past continuous analysis, 6am-9am seems to be the peak slowness time on weekdays, maybe moreso today with yesterday being a US holiday, and everyone getting back to work with their AI questions. You can really see the chunk progress pause and struggle, as though inference is time-slicing between users.

Hopefully the data ops people will be on this.

A week of performance:

SO12 · October 15, 2024, 2:31pm

Thanks for this, so I know I’m not the only one receiving delayed api responses from 4o with structured outputs…about 4-5 seconds on a small token request. Have you tested with Azure’s api?

_j · October 15, 2024, 2:43pm

There will be delay in receiving the first token when using structured outputs and an original or changed JSON schema for the first time - up to 10 seconds for building a parser index which is cached. So you will not be the only one, as that is an expected artifact of the technology.

This process, and cache lookup, will likely be affected by different computational resources than language inference which is reported to be underperforming from expectations and past use.

Topic		Replies	Views
Extremely long request times- Completions API gpt-4o Bugs gpt-4o	10	535	December 5, 2024
Ongoing latency in GPT 4o this week API	5	256	July 16, 2025
Gpt-4-0125-preview INCREDIBLY slower than 3.5 turbo API	12	9599	July 22, 2024
GPT-3.5 API is very slow. Any fix? API	31	9943	October 12, 2023
Runs randomly take > 30sec Bugs assistants-api	7	727	September 11, 2024

Completions API Suddenly slow

For 3 trials of gpt-4o-2024-08-06 @ 2024-10-15 06:21AM:

For 3 trials of gpt-4o-2024-05-13 @ 2024-10-15 06:21AM:

Related topics