Caching with multimodal seems to be missing in Docs?

vincentkoc · November 10, 2025, 9:02am

Can i get some clarity on how multimodal use-cases from audio, video, images etc are affected with cache?

sps · November 10, 2025, 9:42am

Welcome to the dev community, @vincentkoc!

Caching works for model inputs. Thus, multimodal input content like images and audio are tokenized and cached as well, along with text tokens. Upon a cache hit, the number of cached tokens hit will be billed at the cached input price, while the rest are billed at regular pricing.

Quoting from docs:

Images: Images included in user messages, either as links or as base64-encoded data, as well as multiple images, can be sent. Ensure the detail parameter is set identically, as it impacts image tokenization.

It’s important to note that caching also depends on the model.

e.g. gpt-audio does not support cached inputs, while gpt-realtime does. You can get model-specific details on the models page.

Topic		Replies	Views
Cached Input Tokens in Chat Completions API chatgpt , api , chat-completion , token , cache	1	2685	April 30, 2025
GPT-4o Mini Caching Pricing Issue API	1	279	March 11, 2025
Cached input audio_tokens is always 0 API audio , realtime	3	540	November 8, 2024
Clarification on Token Usage for Image Inputs API api	4	261	September 10, 2025
How does Prompt Caching work? Prompting api , prompt-caching	8	8122	October 29, 2024

Caching with multimodal seems to be missing in Docs?

Related topics