Will the newly announced prompt caching work with my assistants api calls? currently I’m having to spend 3k tokens per run and most of that is coming from the same extensive instructions being re-sent every time to ensure the responses are usable.
I’d also appreciate clarity on this. I similarly use the same long prompt at the start of threads so it would be really helpful from a cost standpoint if these were cached.
Hello, welcome both of you.
It’s not very clear to me either, but from what I can tell in the language there is support for assistant interaction.
https://platform.openai.com/docs/guides/prompt-caching/what-can-be-cached
Certainly if Assistants aren’t supported they will be eventually.
My experience so far (2024-10-29) is that caching does not work with the Assistant API. The prompt_tokens_details
field from the usage
returned by the /run
endpoint is not provided. By contrast, it is provided on the chat completion endpoint even when the cached tokens are 0.
I imagine it is a matter of time before it is implemented - the underlying thread objects from the assistant api seem to provide even fancier caching potential.
On the date I posted this reply, Assistant API has returned the cached tokens but under prompt_token_details
instead of prompt_tokens_details
(notice missing s in token
word).