Documentation issue: Structured Outputs implies NO initial delay except for fine-tuning models

If one searches any of structured outputs or function calling documentation, this is the ONLY mention that appears about the initial latency of setting up a context-free grammar:

Specifically for fine tuned models:

  • Schemas undergo additional processing on the first request (and are then cached). If your schemas vary from request to request, this may result in higher latencies.

Why would it mention fine tuned model (besides the fact that they are slow to run when cold?)

The CFG construction latency delay is real and impactful on the fastest model OpenAI’s got, fine-tuning or not.

Showing below: 10 seconds to get “hello” from gpt-4.1-nano with six functions:

The documentation downplays any mention of this.


Also not mentioned is the lifetime of a cache of this structured output enforcement - and if it shares similar server instance cache hash method as just clarified in documentation for the context window caching discount, or if it is a more persistent method. Then, the scope of this persistence across projects, across organizations.

Just to investigate the persistence of structured outputs artifact for functions, the exact request pictured above I had saved as a Playground preset, simply to run it again. Well, 15 hours later, seven weekend morning seconds needed to get four tokens of output:

A following API call was 0.5 seconds.