Realtime API input audio tokens increase even if text is entered.

kazuya.hou · November 14, 2024, 10:23pm

I have contacted OpenAI support to see if this is a specification, but have received no response.
Specify audio and text as modalities, and the input transcript is null.
And the output is audio and text, while the input is only text.

Nevertheless, the number of audio tokens in the input increases.And text, of course.

I asked them what appears to be an obviously fraudulent claim, and they, have no answer.Does anyone know anything about this?

ivan-luchkin-u · November 14, 2024, 10:36pm

I have started a topic on this very issue earlier

This is expected behavior. The model needs to consume output audio tokens because they are a part of the conversation history.

kazuya.hou · November 14, 2024, 10:55pm

Thanks, @ivan-luchkin-u useful information.

Topic		Replies	Views
Realtime API re-consuming it's own output audio as input audio API audio , realtime , api-realtime , api-realtime-speech	10	1333	January 10, 2025
Why are there text tokens in Realtime API API api-realtime	1	156	April 22, 2025
Why does each new request in Realtime API get more expensive? Are tokens accumulating? API realtime , api-realtime	1	242	September 5, 2025
Realtime API pricing questions: text input and audio tokens API realtime	7	475	December 6, 2025
[Realtime API] Why does `input_audio_transcription` usage include `text_tokens`? API api	0	74	October 14, 2025

Realtime API input audio tokens increase even if text is entered.

Related topics