Prompt Caching Hierarchy with Structured Outputs

Summary

Adding the structured output schema as a prefix to the system prompt leads to unintuitive caching results as users typically store the content they want to cache in their prompts, not their schema.

Options to consider

  1. Make the structured output schema the last factor considered for prompt caching, rather than the first.
  2. Add an option to enable/disable caching of the structured output schema. My preference would be to disable it by default and add an option to enable it.

Workaround

Until a change is implemented, users can place the content they want to cache at the beginning of the schema. For example:

const completion = await openai.chat.completions.create({
  model: "gpt-4o-mini",
  messages: [
    {
      role: "user",
      content: `Summarize the story in 12 words.`,
    },
  ],
  response_format: zodResponseFormat(
    z.object({
      contentToCache: z.boolean().describe(storyAboutPirates),
      synopsis: z.string().describe(`Summarize the story in 12 words.`),
    }),
    "synopsis",
  ),
});
1 Like

I’m not sure I follow the goal of this recommendation and would just like to understand the scenario better… the “schema” simply defines the shape of the output you want back from the model and unless you’re changing the type of object you want back on a per request basis it shouldn’t be changing. If anything, it’s likely to be the least changing part of your prompt so it makes perfect sense to me that they cache it first.

There may be scenarios where the schema changes between prompts but those scenarios seem rare to me…

Their goal with caching would be to cache the most static parts of the prompt first which should be the output schema when using structured outputs. The “instances” of those outputs that get returned by the model can also be cached as part of the conversation history assuming that nothing in your system prompt changes between requests.

For example, you might want to avoid putting the current time in your system message as this would change it on every request. Since the system message is the first message this would prevent them from caching anything but the schema. If they moved the schema come after the system message they wouldn’t even cache that.

1 Like

When you have long-form content (ex. movie script or financial report) that you need to extract a bunch of information from, you’re unlikely to extract all of the desired structured data in one request because of OpenAI’s various limits and because the models can start giving poor answers. Therefore, you end up sending multiple requests to review the same document and extract different pieces of information. My sense is that this use case is quite common.

Possibly an option to enable/disable caching of the structured output schema would support all use cases.

1 Like

this is very true, the shorter the prompt the better results I tend to get. Not sure if it is some sort of bias from chatGPT algorithm but I do this too. I just remembered, I haven’t had the time to test the caching new feature. Will hopefully do it tmrw.

I would also love to see a solution for this as well.

In my use case, I have a bunch of images that I am asking various questions about. The structured output response schemas vary with the type of question but the images stay the same and I would love to be able to cache them.
For what it’s worth, the workaround suggested above (thanks @dyeoman2) appears to only work with text but not images.

I find the current caching behavior to be optimal, because my schemas tend to be more static/repeated than the system prompt or user prompt. Changing the order as proposed here would therefore lead to fewer/shorter cache hits for my use case.

I encountered the exact same issue with structured outputs breaking prompt caching! The schema being prepended to the prompt means any schema change invalidates the
cache, which is really frustrating when you want to use different output structures at different conversation stages.

Just posing my alternative workaround here in case anyone else Googling reaches this post (as I did)

My Workaround

I found a solution that preserves caching while still using structured outputs throughout a multi-turn conversation: define a single comprehensive schema upfront that
includes ALL the fields you’ll need across all conversation turns, then instruct the model to fill only specific fields at each step while leaving others null/empty.

How It Works

Instead of changing the schema at each step (which breaks caching), you:

  1. Create one schema with all possible output fields you’ll need
  2. At each conversation turn, ask the model to populate only the relevant fields
  3. The schema stays constant, so it gets cached after the first request
  4. Subsequent requests benefit from caching the schema + previous conversation context

Test Script

I created a test script that demonstrates this approach and validates the caching behavior. It makes three API calls with the same schema but different instructions,
and reports on token usage and cache hits:

I can’t include the exact link here but if I tell you my github username is skwirrel and the gist ID is eacc00f410aa82e7b065edf79b76e4b0 you should be able to find it!

The test script shows that caching is being used on the subsequent turns, and when I tried it ChatGPT filled out only the relevant part of the schema for each turn.

P.S. I created a modified version of my test script which used a different schema for each turn of the conversation and, as expected, this breaks the caching - zero cached tokens are used. So this is definitely still an issue as of December 2025.