Prompt Caching Hierarchy with Structured Outputs

Summary

Adding the structured output schema as a prefix to the system prompt leads to unintuitive caching results as users typically store the content they want to cache in their prompts, not their schema.

Options to consider

  1. Make the structured output schema the last factor considered for prompt caching, rather than the first.
  2. Add an option to enable/disable caching of the structured output schema. My preference would be to disable it by default and add an option to enable it.

Workaround

Until a change is implemented, users can place the content they want to cache at the beginning of the schema. For example:

const completion = await openai.chat.completions.create({
  model: "gpt-4o-mini",
  messages: [
    {
      role: "user",
      content: `Summarize the story in 12 words.`,
    },
  ],
  response_format: zodResponseFormat(
    z.object({
      contentToCache: z.boolean().describe(storyAboutPirates),
      synopsis: z.string().describe(`Summarize the story in 12 words.`),
    }),
    "synopsis",
  ),
});
1 Like

I’m not sure I follow the goal of this recommendation and would just like to understand the scenario better… the “schema” simply defines the shape of the output you want back from the model and unless you’re changing the type of object you want back on a per request basis it shouldn’t be changing. If anything, it’s likely to be the least changing part of your prompt so it makes perfect sense to me that they cache it first.

There may be scenarios where the schema changes between prompts but those scenarios seem rare to me…

Their goal with caching would be to cache the most static parts of the prompt first which should be the output schema when using structured outputs. The “instances” of those outputs that get returned by the model can also be cached as part of the conversation history assuming that nothing in your system prompt changes between requests.

For example, you might want to avoid putting the current time in your system message as this would change it on every request. Since the system message is the first message this would prevent them from caching anything but the schema. If they moved the schema come after the system message they wouldn’t even cache that.

1 Like

When you have long-form content (ex. movie script or financial report) that you need to extract a bunch of information from, you’re unlikely to extract all of the desired structured data in one request because of OpenAI’s various limits and because the models can start giving poor answers. Therefore, you end up sending multiple requests to review the same document and extract different pieces of information. My sense is that this use case is quite common.

Possibly an option to enable/disable caching of the structured output schema would support all use cases.

this is very true, the shorter the prompt the better results I tend to get. Not sure if it is some sort of bias from chatGPT algorithm but I do this too. I just remembered, I haven’t had the time to test the caching new feature. Will hopefully do it tmrw.

I would also love to see a solution for this as well.

In my use case, I have a bunch of images that I am asking various questions about. The structured output response schemas vary with the type of question but the images stay the same and I would love to be able to cache them.
For what it’s worth, the workaround suggested above (thanks @dyeoman2) appears to only work with text but not images.

I find the current caching behavior to be optimal, because my schemas tend to be more static/repeated than the system prompt or user prompt. Changing the order as proposed here would therefore lead to fewer/shorter cache hits for my use case.