Can i get some clarity on how multimodal use-cases from audio, video, images etc are affected with cache?
1 Like
Welcome to the dev community, @vincentkoc!
Caching works for model inputs. Thus, multimodal input content like images and audio are tokenized and cached as well, along with text tokens. Upon a cache hit, the number of cached tokens hit will be billed at the cached input price, while the rest are billed at regular pricing.
Quoting from docs:
Images: Images included in user messages, either as links or as base64-encoded data, as well as multiple images, can be sent. Ensure the detail parameter is set identically, as it impacts image tokenization.
It’s important to note that caching also depends on the model.
e.g. gpt-audio does not support cached inputs, while gpt-realtime does. You can get model-specific details on the models page.
3 Likes
