Hi everyone,
I’m seeing a big gap between the expected and actual time to first token when using GPT-4.1 through the OpenAI Agents SDK.
From the docs and community discussions, the TTFB should be around ~400 ms. But in our case:
Most requests take 2–3 seconds for the first token
Many queries take 40+ seconds
Some even go beyond that
I’m not sure if we’re missing something in our setup or if this is a broader issue. The slow responses are starting to frustrate our users, so I’d really appreciate any guidance.
Has anyone else faced this?
Could it be related to the SDK, configuration, streaming, or something else entirely?
My first suspicion would be your functions and the json_schema response format.
If any of these use “strict” (which functions in the right shape seem to do implicitly), then there will be a grammar enforcement artifact that needs to be built.
This should be cached after a long initial first call of the particular API setup, but “cached” also implies that the input needs to be profiled and something retrieved on each future run.
You do have a good number of input tokens there also, which need to be run through as model input before the AI can start generating.
Then, Responses as a place to run ‘agents’ is a middleman between you and an actual AI model call, which can be systematically classified as slower when using the same call pattern vs Chat Completions.
Paying for “service_tier”: “priority” might cut down on the 100+ second time, as OpenAI seems to be binning normal calls without the doubled expense to “low output rate, throttled on our slowest overloaded GPUs” or similar impression.
I really need some help figuring out how to fix this. We’re already on Tier 4, but our response times are still slow — around 2 seconds to first token on almost every call, and sometimes it goes 40+ seconds, which is causing problems for our users.
A few things I’m unsure about:
We use strict=true because our agents need to follow the schema properly. If we turn it off, the model sometimes ignores the schema. Is there any way to keep strict mode but still reduce the delay?
Would caching actually make a big difference after the first call? Right now the slowness happens all the time, not just once.
Since Tier 4 is supposed to give better performance, should we still expect these delays? Or do we need something else like the priority tier or moving away from the Agents SDK?
Right now our agents feel too slow for production, so any practical suggestions would really help.