If one searches any of structured outputs or function calling documentation, this is the ONLY mention that appears about the initial latency of setting up a context-free grammar:
Specifically for fine tuned models:
- Schemas undergo additional processing on the first request (and are then cached). If your schemas vary from request to request, this may result in higher latencies.
Why would it mention fine tuned model (besides the fact that they are slow to run when cold?)
The CFG construction latency delay is real and impactful on the fastest model OpenAI’s got, fine-tuning or not.
Showing below: 10 seconds to get “hello” from gpt-4.1-nano with six functions:
The documentation downplays any mention of this.
Also not mentioned is the lifetime of a cache of this structured output enforcement - and if it shares similar server instance cache hash method as just clarified in documentation for the context window caching discount, or if it is a more persistent method. Then, the scope of this persistence across projects, across organizations.