Creepy bug of Realtime API + Function Calling: Extra Audio Not in Transcription

This is actually two bugs combined:

  1. Arguments for Function Calling appear inside the text transcription.
  2. When this happens, the generated audio contains unrelated content instead of voicing these arguments. Often, this content has no connection to the conversation topic and can even be in a different language.

In my example, the audio contains more than twice as much speech as the transcription, becomes chaotic toward the end, and includes repetitions.

Video (with generated audio) how this session looked like, untranscripted audio starts at 0:27: openai_realtime_session_4c740659-88bb-4c56-9765-a80240491b76_coral.mp4 - Google Drive

Full websocket session data exchange: openai_realtime_session_4c740659-88bb-4c56-9765-a80240491b76_coral.txt - Google Drive

Screenshot:


Another example.
In this case, the audio unexpectedly switches from Russian to Japanese at 0:21, right after voicing the transcribed text: openai_realtime_session_ee51d0bc-a878-41a5-bd28-5d408f60dc1d_coral.mp4 - Google Drive

Full websocket session data exchange: openai_realtime_session_ee51d0bc-a878-41a5-bd28-5d408f60dc1d_coral.txt - Google Drive

Let me know if you need any additional logs or details to help debug this issue.

2 Likes

I’ve also seen this! However, not to the scale of 30 seconds.

1 Like

Maybe that’s why.
we’ve always seen weird glitches when they’re updating something.

@hugebelts, not sure, I’ve seen the bug reproducing for a few days already. Also I see very similar issues in “Related” section:

Looks like the bug is present since October.

2 Likes

Which of the OpenAI realtime API voices is this? Coral?

I’ve never had any extraneous spoken dialog in my interactions, interested to find out what might be the cause.

Sending blank audio at odd times can cause issues, also very low quality audio as input can confuse the model. Do you happen to keep a copy of the input audio to check against?

1 Like

@Foxalabs, I don’t think it depends on voice. I reproduced the bug with coral, shimmer and sage. Haven’t tested with other voices.

1 Like

Gotcha, I’ve never managed to get that effect so I wonder what you’re going different, do you have your end to end speech code to look at?

Personally I just handle socket comms to the OPenAI endpoint and do it manually, not sure if this is with the WebRTC or not?

@Foxalabs, I’m also working with Realtime API manually via websocket. See full log of websocket sessions which I posted in the original message above.

I’m asking Realtime model to do both: respond to user and also call function set_emotion. Sometimes it works as expected, but sometimes this crazy bug appears. I suppose that incorrectly appearing function call in the response text causes audio to go crazy.

2 Likes

I’m still seeing this in some cases, I don’t think it’s a function call since I don’t use function calling. The AI sometimes also starts spewing text from previous responses that isn’t included in the transcription then transitions to what it’s saying now.

1 Like

I’ll make sure I raise it at our next meeting with OAI, the realtime API is still in beta and does have a number of issues, most of which are already logged and being worked on. Big fan of the low latency speech API’s myself, the potential is huge.

2 Likes

Hi @Foxalabs!
Did you have a chance to tell OpenAI about the bug on a meeting?

Experiencing the same thing. Really odd and a bit creepy

I realised that sometimes it happens without function calling too.

Yeah, this happens to me too. Not often, but often enough. The transcription looks fine, but the audio contains very strange (and yes, creepy) sentences.