As mentioned here,
One of the most powerful features of the Realtime API is voice-to-voice interaction with the model, without an intermediate text-to-speech or speech-to-text step.
Why then am I not allowed to set only audio modality in session.update
event?
{
"type": "session.update",
"session": {
"modalities": ["audio"],
},
}
I instead have to set it to ["audio", "text"]
. And I am being charged for the text tokens as per the usage details in response.done
event where the text token counts are non-zero. But I don’t need any text tokens. And if the model is an end-to-end audio to audio model, then the text token counts should be zero, right?