In order to fulfill a task, my system breaks down inference in different bits. This allows me to tackle the task in parallel, with specialized prompts for specialized agents that collaborate to fulfill the task. So my inference isn’t your typical “conversation” where user/assistant bits get added over time.
Some of the prompts used are long – 500 to 13000 tokens; average 4000 token; a typical “long” prompt has 8000 input tokens. Task-specific variables are injected in the prompt, typically starting at around 50-60% of the prompt. This means that I’d technically be able to benefit from caching for ~50% of my input tokens.
After experimenting, I noticed that CompletionUsage.prompt_tokens_details.cached_tokens is always 0. My assumption is that OpenAI only keeps one cached prompt, meaning if you have two different prompts used one after the other, but repeatedly, you will not benefit from caching. Is this understanding correct?
If my understanding is correct, what might be workarounds for this?
One thing I was thinking of trying was to split my project into a number of different API keys (one key per prompt), to see if caching was at the key level (we currently use a single key, and keep track of inference accounting with a separate analytics platform). But I suspect that keys are only used by OpenAI to allow for more granular reporting and limiting.