I am performing a VQA task using GPT-4o Mini.
The only element that should be cached is the system prompt, so I expected minimal caching to occur.
However, I noticed that there are an unusually high number of cached elements.
Looking at the screenshot, it’s clear that GPT-4o is caching reasonable elements.
What’s the issue here? The costs have skyrocketed.
Caching is a mechanism saving you money on inputs to API models that are repeated with a majority identical input. $7 billed as cached input would be $14 if that didn’t exist.
The thing that is remarkable here is the ratio of how much input cost you have to how much output is being generated, cached or not.
You say “vision question-answering.” A reminder: GPT-4o-mini costs twice as much for image input as GPT-4o, the opposite of normal expectations, so you cannot expect a ton of images (like those extracted from a video) to be super-cheap like text tokens are on mini. Instead, the cost is increased.
I suspect if you record the usage object being returned from API calls with token usage of each type, you’ll quickly discover what is costing you, and where caching is providing a benefit. Logging all API calls made to a function with the inputs sent, or AI analysis of your code, may reveal a programming error. OpenAI had an issue about a month ago of not billing the 33.33x token cost for mini input images…that was fixed.