Why are there text tokens in Realtime API

shreehingane19818 · April 21, 2025, 8:01am

As mentioned here,

One of the most powerful features of the Realtime API is voice-to-voice interaction with the model, without an intermediate text-to-speech or speech-to-text step.

Why then am I not allowed to set only audio modality in session.update event?

{
    "type": "session.update",
    "session": {
        "modalities": ["audio"],
    },
}

I instead have to set it to ["audio", "text"]. And I am being charged for the text tokens as per the usage details in response.done event where the text token counts are non-zero. But I don’t need any text tokens. And if the model is an end-to-end audio to audio model, then the text token counts should be zero, right?

shreehingane19818 · April 22, 2025, 3:20am

As a follow up, when I do get the text tokens (for model response), how are they computed? Are they an intermediate representation before the output audio, or are they also generated by running the transcription model on the output audio.

Topic		Replies	Views
Realtime API input audio tokens increase even if text is entered. API realtime	2	237	November 14, 2024
Realtime API text and audio API realtime	0	156	October 8, 2024
Confusion Regarding Tokenization Calculation in Realtime API and Potential Double Charging Concerns API token , realtime	3	539	May 5, 2025
How to get text only output from the Realtime API? API api , realtime	14	4044	June 20, 2025
Missing response.text.done and response.text.delta events, receiving only audio responses API	1	205	May 27, 2025

Why are there text tokens in Realtime API

Related topics