Hi,
I’m trying to get my fine-tuned model on the platform to use input caching, but even though I have a static, and long system prompt (around 1200 tokens), it doesn’t use caching and I have to pay the full monies unnecessarily. I’m limiting RPM under 15, since that’s why it recommends, but even with 10 secs between each query, caching still doesn’t work.
And the documentation is really vauge, it talks about a prompt_key_cache parameter to increase changes of cache hit, but it doesn’t take that argument anywhere when I’m using structured outputs parsed with Pydantic & responses api.
Is the documentation intentionally vague? I ran across multiple threads but none of them had a conclusive answer, so wanted to open a new topic.
Interestingly, if I use the regular gpt-4o instead of my fine-tuned model, caching works correctly. Is this not available with fine-tuned models? If so, why doesn’t it say so anywhere in the documentation?
Thanks for your help!