Unexpectedly High Time to First Token with GPT-4.1 (Agents SDK)

Sushmita_Gupta · October 16, 2025, 12:10pm

Hi everyone,
I’m seeing a big gap between the expected and actual time to first token when using GPT-4.1 through the OpenAI Agents SDK.

From the docs and community discussions, the TTFB should be around ~400 ms. But in our case:

Most requests take 2–3 seconds for the first token
Many queries take 40+ seconds
Some even go beyond that

I’m not sure if we’re missing something in our setup or if this is a broader issue. The slow responses are starting to frustrate our users, so I’d really appreciate any guidance.

Has anyone else faced this?
Could it be related to the SDK, configuration, streaming, or something else entirely?

_j · October 16, 2025, 5:54pm

My first suspicion would be your functions and the json_schema response format.

If any of these use “strict” (which functions in the right shape seem to do implicitly), then there will be a grammar enforcement artifact that needs to be built.

This should be cached after a long initial first call of the particular API setup, but “cached” also implies that the input needs to be profiled and something retrieved on each future run.

You do have a good number of input tokens there also, which need to be run through as model input before the AI can start generating.

Then, Responses as a place to run ‘agents’ is a middleman between you and an actual AI model call, which can be systematically classified as slower when using the same call pattern vs Chat Completions.

Paying for “service_tier”: “priority” might cut down on the 100+ second time, as OpenAI seems to be binning normal calls without the doubled expense to “low output rate, throttled on our slowest overloaded GPUs” or similar impression.

Sushmita_Gupta · October 17, 2025, 6:56am

Thanks a lot for your reply.

I really need some help figuring out how to fix this. We’re already on Tier 4, but our response times are still slow — around 2 seconds to first token on almost every call, and sometimes it goes 40+ seconds, which is causing problems for our users.

A few things I’m unsure about:

We use strict=true because our agents need to follow the schema properly. If we turn it off, the model sometimes ignores the schema.
Is there any way to keep strict mode but still reduce the delay?
Would caching actually make a big difference after the first call? Right now the slowness happens all the time, not just once.
Since Tier 4 is supposed to give better performance, should we still expect these delays? Or do we need something else like the priority tier or moving away from the Agents SDK?

Right now our agents feel too slow for production, so any practical suggestions would really help.

_j · October 17, 2025, 10:41am

“How can I do exactly what I’m doing, without delay?” Go shopping for other AI models?

The tier level does not affect your service quality currently.

You would include the API parameter “service_tier”: “priority” in an API call and immediately see your performance and costs double.

drfalken · November 8, 2025, 6:50am

Any luck improving this? We are seeing the same thing.

Topic		Replies	Views
Gpt-4-0125-preview INCREDIBLY slower than 3.5 turbo API	13	9787	December 29, 2025
Assistant API Performance is Very Slow API plugin-development , api	11	5566	December 29, 2025
GPT-4.1 models are very slow due to API response. API	6	867	December 29, 2025
What is happening with GPT-5 performance? API performance , gpt-5	2	486	October 8, 2025
Discrepancy in Response Speed between GPT-3.5-turbo API and ChatGPT UI API gpt-35-turbo , chatgpt , api	4	3061	December 24, 2023

Unexpectedly High Time to First Token with GPT-4.1 (Agents SDK)

Related topics