Why are there text tokens in Realtime API

As mentioned here,

One of the most powerful features of the Realtime API is voice-to-voice interaction with the model, without an intermediate text-to-speech or speech-to-text step.

Why then am I not allowed to set only audio modality in session.update event?

{
    "type": "session.update",
    "session": {
        "modalities": ["audio"],
    },
}

I instead have to set it to ["audio", "text"]. And I am being charged for the text tokens as per the usage details in response.done event where the text token counts are non-zero. But I don’t need any text tokens. And if the model is an end-to-end audio to audio model, then the text token counts should be zero, right?

As a follow up, when I do get the text tokens (for model response), how are they computed? Are they an intermediate representation before the output audio, or are they also generated by running the transcription model on the output audio.