Structured Output: Caching and Latency

ericlaycock44 · August 12, 2024, 6:46pm

Hi there,

The new structured output feature is excellent! Perfect for a task I have. It’s mentioned that the first time a schema is used, it incurs a latency of ~10-60s, but is thereafter cached to allow for regular inference speed.

I’m curious as to the “scope” of this caching, as it’s very important for us to provide our users with the fastest inference time possible. Is the schema cached only for that particular conversation (ie. the scope is the conversation ID), or for the particular API key being used to access the API?

In essence, I’m trying to understand if every time we send an API request to, say, 4o-mini using a constant schema we incur that latency, or if it’s only for the first time we use that schema for our particular API key.

Cheers
Eric

Foxalabs · August 12, 2024, 6:53pm

Hi Eric,

I don’t have any details on the cache duration, I will add it to our list of questions we ask the OpenAI team for clarification on, that won’t be until next week now.

platypus · August 12, 2024, 8:26pm

@ericlaycock44 I tested it out myself using a single API key with multiple calls over several days, and can confirm that it’s a one-off penalty at the very beginning, with subsequent calls being significantly faster.

My understanding is that the schema gets converted to a context free grammar (CFG), which is then used to inform logits during the token sampling/decoding, i.e. text “generation”, step. So presumably, that one-off penalty is due to this conversion from JSON/string to a CFG, together with a hashing of your schema and storing it in db next to your API key. On subsequent requests, the schema is hashed and looked-up in db, avoiding the whole CFG conversion part.

ericlaycock44 · August 14, 2024, 12:00pm

@platypus sweet, thanks for your insights!

pconstantine · November 18, 2024, 3:31am

@platypus as of today your observation is incorrect. Schema is stored with a ttl, for sure, if schema is not reused in the next 120 sec or so it will be recalculated again. And the penalty will be paid again. More than that, some schemas are much more problematic than others. Especially schemas with a lot of enum fields can be pretty hefty as it may take over 60 sec to recalculate.

It would be much better if OpenAI communicated what the tradeoffs are.

derryk · March 24, 2025, 7:07pm

Is there any details for how the cache is updated. I am sending a schema which has the same keys, but different values and it looks like it uses the cached version causing the values to be incorrect.

Topic		Replies	Views
Structured Output Latency - How is Caching done? API gpt-4 , api , json , response_format	0	419	August 8, 2024
How long OpenAI keep the cached converted schema for structured output API structured-output	0	188	October 24, 2024
Documentation issue: Structured Outputs implies NO initial delay except for fine-tuning models Documentation openai-documentation	1	63	May 24, 2025
Limits of storing large numbers of dynamically generated Structured JSON Output schemas API api , structured-output	1	145	October 13, 2024
Structured Outputs tokens and latency API token , json , json-mode	1	2325	August 9, 2024

Structured Output: Caching and Latency

Related topics