Realtime API re-consuming it's own output audio as input audio

I have built my own client which does not send any audio to the API, but the “audio” modality is still enabled so that the model can respond with audio. I have also implemented something to keep track of all the tokens (i/o, cached etc.) so that I can calculate the price.

To my surprise, I have found out that audio_tokens are not zero starting from the second model response, which I assume can only mean that the model feeds on the audio tokens it generated previously.

Can someone please elaborate on this behavior? Is it intended? I understand that the model has to keep track of the conversation history on each turn, but this specific issue is very counter-intuitive, especially given how much the audio is a cost factor for this API, especially with how this API is priced

Also reproducible in the playground when just inputting text using the keyboard button

Yes, that’s indeed what happens. Assuming the API generates both text and audio in each turn and you send only text, the conversation is usually like this:

---- Turn 1 ----
User: input_text1
Assistant: output_text1 + output_audio1
---- Turn 2 ----
User: input_text2 + (input_text1 + output_text1 + output_audio1)
Assistant: output_text2 + output_audio2

And so on.

They released prompt caching a couple of days ago to reduce this cumulative cost which is s welcome update, but it’s still too expensive for my use case.

The only thing that looks a bit odd to me is that it seems like the output_text1 (which corresponds to the response audio transcript) is also being fed to the model each turn (Although this is hard to confirm just by looking at the usage that is returned in each response.done event). What I’m saying is that, if the output audio is already being passed in each turn to maintain context, is it necessary to pass the transcript in as well? I hope an OpenAI dev could provide an answer to this.

So I’m speculating a bit here but my “educated” guess is that they need the transcript tokens because that’s the only thing they can bill you for… they bill based on tokens but what’s the token count for a gunshot sound? (Used as an example)

The transcript text gives them something they can bill against. They also want you to use the transcript text when restarting old conversations and to perform moderation so the transcript text is playing multiple roles which adds to the confusion…

The billing is seperate from transcription and actual realtime audio output.
There are different tokens being used.
You can even turn off transcriptions and still be billed for audio output.

As for @ivan-luchkin-u:
You are correct. It is because they have to store the history of the conversation as this is the only technical way to do it.

If you do not want this, you can pass a new session.create with every response you pass.

Hope this helps. :hugs:

I think it would be interesting to be able to control what’s being appended to the conversation history (I’m talking about the assistant messages).
Right now both the assistant’s output audio and its transcript are being passed on as part of the conversation history automatically each turn. Given how accurate the output transcript usually is, I reckon it might be enough to pass this text and omit the audio altogether (which will incur less cost because text tokens are way cheaper than audio). Not sure how it will affect the model performance though.

This would be a really effective solution for reducing cost if transcription quality is sufficient in a particular case and retaining the conversation history is still needed

I question that… if I say “read my emails to me” and the assistant starts but I interrupt with “slow down please” the information needed by the model isn’t in the text transcript.

There are lots of subtle examples of this….

On a separate note… nothing actually prevents you from deleting the last node of conversation history that’s text+audio and then adding it back in as text only. You can do that today

2 Likes

That cannot work. The AI would drop out of voice mode.

For all the months of holding this back, producing voice from text input is NOT trained, and doing so would make the model completely incompatible and unreusable as a general-purpose model on chat completions.

The server side maintenance of an audio chat history as future input is essential.

The only method you are given is to delete messages by ID. Because of caching and deleting the oldest breaking the cache, you would not want to do that except when it significantly brings down your budget again.

2 Likes

Well, that was the kinda the explanation I was looking for. Thanks for the info. One other thing, can you elaborate on why would the model drop out of voice mode?
I think I saw that behavior last time I tried to reconstruct the conversation history on a new session with the previous session’s messages. Is it because of the way the model is trained?

I tested it using the playground and the same behavior of feeding the input by itself happens, falling into an eternal loop, apparently the Web Audio API does not handle satisfactorily when you have a Speaker + MIC when both are open, unlike the APIs in cell phones.

I managed to solve the problem by switching between the model’s speech and the user’s speech, but I still have the need to interrupt the model’s speech in visibly uninteresting responses,

I’m thinking about including a type of keyword in some concurrente in web audio api to a textual transcription, if I’m successful I’ll post it here.