I am getting on average a 15 second delay time between making an OpenAI text completion call and getting a response from the API. Usually it’s only a second or two. I am only making one call every 10 seconds or so (at least), so it’s not a usage throttling issue. I am using this model:
gpt-4-1106-preview
Is anybody else seeing this today? It’s a pretty bad look for a search page like mine. What’s weird is that the delay is almost constant, as if it’s an artificial delay being added on the API side. If there is a model with a faster response time, that it is at the level of gpt-4-1106-preview, please tell me it’s name.
Here’s the current performance I get with gpt-4-turbo models, directly using their versioned model names instead of the odd series of aliases. No systematic delay.
Model
Trials
Avg Latency (s)
Avg Rate (tokens/s)
gpt-4-0125-preview
3
0.988
33.101
gpt-4-1106-preview
3
0.982
37.097
gpt-4-turbo-2024-04-09
3
1.015
39.084
You can see if the situation is immediately improved with alternates. 1106 (2023) should be skipped because of issues with international character sets. Only gpt-4-turbo-2024-04-09 supports vision.
If you are not streaming, then instead of the “latency”, the time to the first token, you need to wait for the generation of the entire response to finish at the generation rate.
The size of the input sent can also slightly affect the generation speed.
You can see if your task can be serviced well by newer “turbo-like” models, with reduced costs.
Model
Trials
Avg Latency (s)
Avg Rate (tokens/s)
gpt-4o-2024-11-20
3
0.622
62.735
gpt-4.1
3
1.042
51.339
Performance is what I just obtained during night for the Western hemisphere…
Just to note: putting the Assistants endpoint in front of models can add its own delays. It is an intermediary between you and the chat completions endpoint, taking multiple calls and queued processing.
Thanks. I’ll give those other models a try. I don’t stream because the answer I get back from my prompt is a small JSON object (30 short field values) that I need in its entirety, so there’s no point in streaming. Unless for some really odd reason that the time to receive the entire answer is actually faster when streamed then not, which would be a rather strange system behavior?
Note: I am not using the Assistants API, just direct API endpoint calls.
UPDATE: Thanks Jay! That did it. so that “preview” model is dog slow for some reason. I switched to gpt-4o-2024-11-20 and it was only about 1.5 seconds.
Any major differences between gpt-4.1 and gpt-4o-2024-11-20 in accuracy (llm capability) and cost?