Cached input audio_tokens is always 0

I’m using the Realtime API (based off of the GitHub console example) and running into no input audio tokens being cached, as the title says. Is there anything I need to do on my end to get this to work as expected? I would guess that the caching identification would occur on the server side.

My project ID is proj_gVgeRdz2IgsyNgRukZ5IPOvs if anyone from OpenAI like to look closer. Short conversations are ballooning in cost due to the audio aggregation and cache miss, making production deployment unreasonable.

We (on this forum) have already established that output audio tokens are getting consumed as input on each new turn, which means that at least some audio tokens should be cached.

From looking at the latest commits (and just code in general) in the reference client published by OpenAI on GitHub, I can tell that they haven’t added anything related to the prompt caching.

Can you clarify on how exactly have you come to a conclusion that nothing is being cached?

Yep, I’m manually inspecting all events in a relay server and this is the raw payload for a response.done event:

{
  "type": "response.done",
  "event_id": "event_ARJvLp4nEMT6iuuBqPmVJ",
  "response": {
    "object": "realtime.response",
    "id": "resp_ARJvFDphUQJE6mgYdaK4G",
    "status": "completed",
    "status_details": null,
    "output": [
      {
        "id": "item_ARJvFNjqac1js9tdtVNiU",
        "object": "realtime.item",
        "type": "message",
        "status": "completed",
        "role": "assistant",
        "content": [
          {
            "type": "audio",
            "transcript": "..."
          }
        ]
      }
    ],
    "usage": {
      "total_tokens": 2622,
      "input_tokens": 1911,
      "output_tokens": 711,
      "input_token_details": {
        "text_tokens": 698,
        "audio_tokens": 1213,
        "cached_tokens": 384,
        "cached_tokens_details": {
          "text_tokens": 384,
          "audio_tokens": 0
        }
      },
      "output_token_details": {
        "text_tokens": 116,
        "audio_tokens": 595
      }
    }
  }
}

I would take up the offer “if you have any session IDs to debug”:

Remember: cache times out quickly. You might be there listening to the AI blather for five minutes, and the cache expires at that low end of expected time.

1 Like