Automatic context window caching - better performance? When?

**Analysis of Context Window Caching Performance in gpt-4o **

You get a discount if you submit a similar input to particular models, identical from the start, within a limited expiry time frame. Do you get other benefits?

This report analyzes benchmark results comparing gpt-4o performance with context window caching explicitly encouraged versus disabled using nonces. The benchmark utilized a significant input context size (approx. 1850 tokens) designed to meet OpenAI’s criteria for activating caching mechanisms. Each condition involved 100 trials, measuring latency, throughput rates, total completion time, and system fingerprints, submitted uniformly over 10 seconds.

Cache was “primed” by an initial discarded call, and then a hold-off of three seconds for propagation.

Summary of Statistically Distinguishable Findings:

With a sufficiently large input context, enabling context window caching demonstrates a clear and statistically significant performance improvement in terms of overall throughput and completion time. While average latency shows a slight reduction, the most substantial gains are observed in the average total time and processing rates.

This is in contrast to a low context length of just one turn under 100 tokens, where a 11% increase in the completion time of the output task was seen when not also using the cache-breaking technique of unique token nonce within the first five tokens. Peculiar.

  • Consistent with both observations, caching-promoting input correlates with reduced diversity in system fingerprints – even when there is not a discount you receive (requiring >1024 tokens)

The key statistically distinguishable findings are summarized below:

Metric Caching Enabled (Primed) Caching Disabled (Nonces) Observation
Average Latency (latency_s) 0.606 0.678 Slight Reduction with Caching
Average Stream Rate 59.5 50.4 Statistically Significant Increase (~18%)
Average Total Rate 46.9 39.3 Statistically Significant Increase (~19%)
Average Total Time (total_time_s) 2.789 3.617 Statistically Significant Reduction (~23%)
Maximum Total Time (total_time_s) 5.953 10.407 Substantial Reduction with Caching
Fingerprint Diversity Lower (2 types) Higher (4 types) Qualitative Reduction Observed
Response Preview Variety Highly Consistent More Varied A result of noncing

Detailed Observations and Interpretation:

  • Throughput and Completion Time: The most significant impact of caching is observed in the throughput metrics (Stream Rate, Total Rate) and the Average Total Time. Enabling caching resulted in approximately 18-19% higher average processing rates and a substantial ~23% reduction in the average time required to complete the generation task. This strongly indicates that for large, repeated contexts, caching effectively speeds up the overall processing. The reduction in maximum total time further suggests caching helps mitigate instances of very long processing durations - or could avoid low performing instances that cannot cache.

  • Latency Performance: While average latency (latency_s) was slightly lower with caching enabled, the difference was less pronounced compared to the improvements seen in total time and rates. This suggests that the primary benefit of caching in this scenario lies in accelerating the token generation process after the first token, rather than solely minimizing the initial delay.

  • System Fingerprint and Output Diversity: As observed in previous tests (even those potentially below the caching threshold), enabling caching correlates with reduced diversity in both system fingerprints and the initial content of the generated responses. The cached runs produced highly consistent response previews and were dominated by fewer distinct fingerprints compared to the non-cached runs.

Discussion: Caching Benefits with Large Contexts

These results, obtained using an input context large enough to reliably trigger caching, demonstrate a clear performance advantage for context window caching in gpt-4o. The significant reduction in average total completion time and the corresponding increase in throughput rates confirm that caching mechanisms can substantially improve efficiency when processing large, repetitive input contexts.

I did not simulate significantly longer inputs that may be typical of chat, nor did I await all of my 30-paragraph kitten essay I consistently solicited. The trials were done at 10:00 UTC (2am Pacific), a relatively off-time, and were completed in under 20 seconds total, except for one maximum “cached” outlier taking 10s longer in one batch not in this table.

Conclusion: Context Window Caching Offers Throughput Gains for Activating Contexts

When the input context size meets the necessary criteria (e.g., >1024 tokens - or requiring more because of OpenAI now injecting file_search system messages that then disappear), context window caching in gpt-4o provides substantial performance benefits, primarily manifesting as increased throughput and thus reduced overall completion times. While latency sees a minor improvement, the main advantage lies in accelerating the bulk of the token generation process, not merely “loading”, while below the cacheable discount, there are still effects of the mechanism being in play, potentially punishing small inputs.

Do you have a workload API call that you think would be different still?