Summary
I’m using the Realtime API in conversation mode (not transcription-only mode) with input_audio_transcription enabled to get parallel transcription of user audio input. The model is gpt-4o-mini-transcribe.
I notice that the conversation.item.input_audio_transcription.completed event always includes text_tokens: 1 in the usage breakdown, even though I’m only sending audio data (no text messages).
I’d like to understand what these text tokens represent and whether this is expected behavior.
Configuration
My Realtime API session configuration (via session.update):
{
"type": "session.update",
"session": {
"model": "gpt-realtime-2025-08-28",
"modalities": ["text"],
"input_audio_transcription": {
"model": "gpt-4o-mini-transcribe",
"language": "ja"
// Note: No "prompt" parameter specified (empty/default)
},
"turn_detection": {
"type": "server_vad"
}
}
}
The Realtime session is used for conversation (function calling), and input_audio_transcription provides parallel transcription for logging purposes.
Important: I’m not providing a prompt parameter in the transcription config, so the text tokens cannot be from a custom prompt.
Observed Usage Data
Every transcription event shows a consistent pattern:
{
"type": "conversation.item.input_audio_transcription.completed",
"transcript": "七丸。",
"usage": {
"type": "tokens",
"total_tokens": 16,
"input_tokens": 11,
"input_token_details": {
"text_tokens": 1, // ← Always 1 (even with no prompt!)
"audio_tokens": 10
},
"output_tokens": 5
}
}
Observation: The text_tokens: 1 appears consistently across all transcription events, regardless of:
- Audio content (different utterances)
- Transcription length (output_tokens vary, but input text_tokens is always 1)
- No custom prompt being provided
Questions
-
What do these
text_tokensrepresent?- Is it the
languageparameter ("ja") being tokenized? - Is it internal request metadata (item_id, session_id, etc.)?
- Is it a fixed overhead for the transcription API call?
- Note: Since I’m not providing a
promptparameter, it cannot be from a custom prompt.
- Is it the
-
Is this documented anywhere?
- I couldn’t find official documentation explaining why
text_tokensappear in transcription usage.
- I couldn’t find official documentation explaining why
-
Should I expect this value to always be 1?
- Is this a constant regardless of configuration (language, empty prompt)?
- Would adding a
promptparameter increase this value? - Does it vary based on the length of the
languageparameter or other metadata?
Context
In the Realtime API, I receive two separate usage reports:
response.doneevents with usage for the main conversation (Realtime model)conversation.item.input_audio_transcription.completedevents with usage for transcription (gpt-4o-mini-transcribe)
Understanding what the text_tokens in the transcription usage represent helps me:
- Calculate accurate transcription costs separately from conversation costs
- Explain billing breakdowns to stakeholders
- Better understand the Realtime API’s token accounting model
- Determine if these tokens are fixed overhead or variable based on configuration
Any insights would be greatly appreciated! ![]()