I wanted to confirmed cached_tokens in prompt_tokens_detail. However, it is always return None. Fixed system prompt is more than 2000 tokens. So, I think it should be cached, but I could not confirmed it.
I am using async_client.beta.chat.completions.parse to get response. I’m using structured output.
Model is gpt4o-2024-08-06 model on Azure OpenAI.
I, rather, suspect that with Azure you get a snapshot frozen in time at the time of deployment. You don’t have arbitrary changes to AI models damaging your application.
Thus redeployment, researching where models can be deployed with features from the grid, should be the next avenue for you to pursue to get your working cache on api calls - when repeated in a short time window with nothing changing within the first 1024+ tokens.