Really slow response time with text completions API today?

robert.oschler · May 27, 2025, 3:06am

I am getting on average a 15 second delay time between making an OpenAI text completion call and getting a response from the API. Usually it’s only a second or two. I am only making one call every 10 seconds or so (at least), so it’s not a usage throttling issue. I am using this model:

gpt-4-1106-preview

Is anybody else seeing this today? It’s a pretty bad look for a search page like mine. What’s weird is that the delay is almost constant, as if it’s an artificial delay being added on the API side. If there is a model with a faster response time, that it is at the level of gpt-4-1106-preview, please tell me it’s name.

_j · May 27, 2025, 4:51am

Here’s the current performance I get with gpt-4-turbo models, directly using their versioned model names instead of the odd series of aliases. No systematic delay.

Model	Trials	Avg Latency (s)	Avg Rate (tokens/s)
gpt-4-0125-preview	3	0.988	33.101
gpt-4-1106-preview	3	0.982	37.097
gpt-4-turbo-2024-04-09	3	1.015	39.084

You can see if the situation is immediately improved with alternates. 1106 (2023) should be skipped because of issues with international character sets. Only gpt-4-turbo-2024-04-09 supports vision.

If you are not streaming, then instead of the “latency”, the time to the first token, you need to wait for the generation of the entire response to finish at the generation rate.

The size of the input sent can also slightly affect the generation speed.

You can see if your task can be serviced well by newer “turbo-like” models, with reduced costs.

Model	Trials	Avg Latency (s)	Avg Rate (tokens/s)
gpt-4o-2024-11-20	3	0.622	62.735
gpt-4.1	3	1.042	51.339

Performance is what I just obtained during night for the Western hemisphere…

Just to note: putting the Assistants endpoint in front of models can add its own delays. It is an intermediary between you and the chat completions endpoint, taking multiple calls and queued processing.

robert.oschler · May 27, 2025, 9:20am

Thanks. I’ll give those other models a try. I don’t stream because the answer I get back from my prompt is a small JSON object (30 short field values) that I need in its entirety, so there’s no point in streaming. Unless for some really odd reason that the time to receive the entire answer is actually faster when streamed then not, which would be a rather strange system behavior?

Note: I am not using the Assistants API, just direct API endpoint calls.

UPDATE: Thanks Jay! That did it. so that “preview” model is dog slow for some reason. I switched to gpt-4o-2024-11-20 and it was only about 1.5 seconds.

Any major differences between gpt-4.1 and gpt-4o-2024-11-20 in accuracy (llm capability) and cost?

Topic		Replies	Views
Gpt-4-0125-preview INCREDIBLY slower than 3.5 turbo API	12	9599	July 22, 2024
Gpt-4-0125-preview is slower than gpt-4-0613? Feedback gpt-4 , api	5	5580	January 30, 2024
GPT 4 API is Very Slow Still API gpt-4 , chatgpt , api	15	6807	December 16, 2023
GPT-4o-2024–08–06 slower then previous version API gpt-4o	9	1093	January 7, 2025
Gpt-4o-mini is really slow API gpt-4o-mini	6	2753	March 18, 2025

Really slow response time with text completions API today?

Related topics