[Realtime API] Why does `input_audio_transcription` usage include `text_tokens`?

HBD007 · October 14, 2025, 10:32am

Summary

I’m using the Realtime API in conversation mode (not transcription-only mode) with input_audio_transcription enabled to get parallel transcription of user audio input. The model is gpt-4o-mini-transcribe.

I notice that the conversation.item.input_audio_transcription.completed event always includes text_tokens: 1 in the usage breakdown, even though I’m only sending audio data (no text messages).

I’d like to understand what these text tokens represent and whether this is expected behavior.

Configuration

My Realtime API session configuration (via session.update):

{
  "type": "session.update",
  "session": {
    "model": "gpt-realtime-2025-08-28",
    "modalities": ["text"],
    "input_audio_transcription": {
      "model": "gpt-4o-mini-transcribe",
      "language": "ja"
      // Note: No "prompt" parameter specified (empty/default)
    },
    "turn_detection": {
      "type": "server_vad"
    }
  }
}

The Realtime session is used for conversation (function calling), and input_audio_transcription provides parallel transcription for logging purposes.

Important: I’m not providing a prompt parameter in the transcription config, so the text tokens cannot be from a custom prompt.

Observed Usage Data

Every transcription event shows a consistent pattern:

{
  "type": "conversation.item.input_audio_transcription.completed",
  "transcript": "七丸。",
  "usage": {
    "type": "tokens",
    "total_tokens": 16,
    "input_tokens": 11,
    "input_token_details": {
      "text_tokens": 1,      // ← Always 1 (even with no prompt!)
      "audio_tokens": 10
    },
    "output_tokens": 5
  }
}

Observation: The text_tokens: 1 appears consistently across all transcription events, regardless of:

Audio content (different utterances)
Transcription length (output_tokens vary, but input text_tokens is always 1)
No custom prompt being provided

Questions

What do these text_tokens represent?
- Is it the language parameter ("ja") being tokenized?
- Is it internal request metadata (item_id, session_id, etc.)?
- Is it a fixed overhead for the transcription API call?
- Note: Since I’m not providing a prompt parameter, it cannot be from a custom prompt.
Is this documented anywhere?
- I couldn’t find official documentation explaining why text_tokens appear in transcription usage.
Should I expect this value to always be 1?
- Is this a constant regardless of configuration (language, empty prompt)?
- Would adding a prompt parameter increase this value?
- Does it vary based on the length of the language parameter or other metadata?

Context

In the Realtime API, I receive two separate usage reports:

response.done events with usage for the main conversation (Realtime model)
conversation.item.input_audio_transcription.completed events with usage for transcription (gpt-4o-mini-transcribe)

Understanding what the text_tokens in the transcription usage represent helps me:

Calculate accurate transcription costs separately from conversation costs
Explain billing breakdowns to stakeholders
Better understand the Realtime API’s token accounting model
Determine if these tokens are fixed overhead or variable based on configuration

Any insights would be greatly appreciated!

Topic		Replies	Views
Why are there text tokens in Realtime API API api-realtime	1	156	April 22, 2025
Realtime API pricing questions: text input and audio tokens API realtime	7	475	December 6, 2025
Objectively tracking price/token usage in v1/audio/speech and v1/audio/transcriptions? API	1	465	May 22, 2025
Confusion Regarding Tokenization Calculation in Realtime API and Potential Double Charging Concerns API token , realtime	3	720	May 5, 2025
Discrepancy in Token Counts Between tiktoken and API Usage for o4-mini/gpt-4o-mini Bugs api	1	351	May 28, 2025