[Realtime API] Why does `input_audio_transcription` usage include `text_tokens`?

Summary

I’m using the Realtime API in conversation mode (not transcription-only mode) with input_audio_transcription enabled to get parallel transcription of user audio input. The model is gpt-4o-mini-transcribe.

I notice that the conversation.item.input_audio_transcription.completed event always includes text_tokens: 1 in the usage breakdown, even though I’m only sending audio data (no text messages).

I’d like to understand what these text tokens represent and whether this is expected behavior.

Configuration

My Realtime API session configuration (via session.update):

{
  "type": "session.update",
  "session": {
    "model": "gpt-realtime-2025-08-28",
    "modalities": ["text"],
    "input_audio_transcription": {
      "model": "gpt-4o-mini-transcribe",
      "language": "ja"
      // Note: No "prompt" parameter specified (empty/default)
    },
    "turn_detection": {
      "type": "server_vad"
    }
  }
}

The Realtime session is used for conversation (function calling), and input_audio_transcription provides parallel transcription for logging purposes.

Important: I’m not providing a prompt parameter in the transcription config, so the text tokens cannot be from a custom prompt.

Observed Usage Data

Every transcription event shows a consistent pattern:

{
  "type": "conversation.item.input_audio_transcription.completed",
  "transcript": "七丸。",
  "usage": {
    "type": "tokens",
    "total_tokens": 16,
    "input_tokens": 11,
    "input_token_details": {
      "text_tokens": 1,      // ← Always 1 (even with no prompt!)
      "audio_tokens": 10
    },
    "output_tokens": 5
  }
}

Observation: The text_tokens: 1 appears consistently across all transcription events, regardless of:

  • Audio content (different utterances)
  • Transcription length (output_tokens vary, but input text_tokens is always 1)
  • No custom prompt being provided

Questions

  1. What do these text_tokens represent?

    • Is it the language parameter ("ja") being tokenized?
    • Is it internal request metadata (item_id, session_id, etc.)?
    • Is it a fixed overhead for the transcription API call?
    • Note: Since I’m not providing a prompt parameter, it cannot be from a custom prompt.
  2. Is this documented anywhere?

    • I couldn’t find official documentation explaining why text_tokens appear in transcription usage.
  3. Should I expect this value to always be 1?

    • Is this a constant regardless of configuration (language, empty prompt)?
    • Would adding a prompt parameter increase this value?
    • Does it vary based on the length of the language parameter or other metadata?

Context

In the Realtime API, I receive two separate usage reports:

  1. response.done events with usage for the main conversation (Realtime model)
  2. conversation.item.input_audio_transcription.completed events with usage for transcription (gpt-4o-mini-transcribe)

Understanding what the text_tokens in the transcription usage represent helps me:

  • Calculate accurate transcription costs separately from conversation costs
  • Explain billing breakdowns to stakeholders
  • Better understand the Realtime API’s token accounting model
  • Determine if these tokens are fixed overhead or variable based on configuration

Any insights would be greatly appreciated! :folded_hands: