Realtime API input audio tokens increase even if text is entered.

I have contacted OpenAI support to see if this is a specification, but have received no response.
Specify audio and text as modalities, and the input transcript is null.
And the output is audio and text, while the input is only text.

Nevertheless, the number of audio tokens in the input increases.And text, of course.

I asked them what appears to be an obviously fraudulent claim, and they, have no answer.Does anyone know anything about this?

I have started a topic on this very issue earlier

This is expected behavior. The model needs to consume output audio tokens because they are a part of the conversation history.

1 Like

Thanks, @ivan-luchkin-u useful information.