Really slow response time with text completions API today?

I am getting on average a 15 second delay time between making an OpenAI text completion call and getting a response from the API. Usually it’s only a second or two. I am only making one call every 10 seconds or so (at least), so it’s not a usage throttling issue. I am using this model:

gpt-4-1106-preview

Is anybody else seeing this today? It’s a pretty bad look for a search page like mine. What’s weird is that the delay is almost constant, as if it’s an artificial delay being added on the API side. If there is a model with a faster response time, that it is at the level of gpt-4-1106-preview, please tell me it’s name.

1 Like

Here’s the current performance I get with gpt-4-turbo models, directly using their versioned model names instead of the odd series of aliases. No systematic delay.

Model Trials Avg Latency (s) Avg Rate (tokens/s)
gpt-4-0125-preview 3 0.988 33.101
gpt-4-1106-preview 3 0.982 37.097
gpt-4-turbo-2024-04-09 3 1.015 39.084

You can see if the situation is immediately improved with alternates. 1106 (2023) should be skipped because of issues with international character sets. Only gpt-4-turbo-2024-04-09 supports vision.

If you are not streaming, then instead of the “latency”, the time to the first token, you need to wait for the generation of the entire response to finish at the generation rate.

The size of the input sent can also slightly affect the generation speed.

You can see if your task can be serviced well by newer “turbo-like” models, with reduced costs.

Model Trials Avg Latency (s) Avg Rate (tokens/s)
gpt-4o-2024-11-20 3 0.622 62.735
gpt-4.1 3 1.042 51.339

Performance is what I just obtained during night for the Western hemisphere…


Just to note: putting the Assistants endpoint in front of models can add its own delays. It is an intermediary between you and the chat completions endpoint, taking multiple calls and queued processing.

1 Like

Thanks. I’ll give those other models a try. I don’t stream because the answer I get back from my prompt is a small JSON object (30 short field values) that I need in its entirety, so there’s no point in streaming. Unless for some really odd reason that the time to receive the entire answer is actually faster when streamed then not, which would be a rather strange system behavior?

Note: I am not using the Assistants API, just direct API endpoint calls.

UPDATE: Thanks Jay! That did it. so that “preview” model is dog slow for some reason. I switched to gpt-4o-2024-11-20 and it was only about 1.5 seconds.

Any major differences between gpt-4.1 and gpt-4o-2024-11-20 in accuracy (llm capability) and cost?