Arguments for Function Calling appear inside the text transcription.
When this happens, the generated audio contains unrelated content instead of voicing these arguments. Often, this content has no connection to the conversation topic and can even be in a different language.
In my example, the audio contains more than twice as much speech as the transcription, becomes chaotic toward the end, and includes repetitions.
Which of the OpenAI realtime API voices is this? Coral?
I’ve never had any extraneous spoken dialog in my interactions, interested to find out what might be the cause.
Sending blank audio at odd times can cause issues, also very low quality audio as input can confuse the model. Do you happen to keep a copy of the input audio to check against?
@Foxalabs, I’m also working with Realtime API manually via websocket. See full log of websocket sessions which I posted in the original message above.
I’m asking Realtime model to do both: respond to user and also call function set_emotion. Sometimes it works as expected, but sometimes this crazy bug appears. I suppose that incorrectly appearing function call in the response text causes audio to go crazy.
I’m still seeing this in some cases, I don’t think it’s a function call since I don’t use function calling. The AI sometimes also starts spewing text from previous responses that isn’t included in the transcription then transitions to what it’s saying now.
I’ll make sure I raise it at our next meeting with OAI, the realtime API is still in beta and does have a number of issues, most of which are already logged and being worked on. Big fan of the low latency speech API’s myself, the potential is huge.
Yeah, this happens to me too. Not often, but often enough. The transcription looks fine, but the audio contains very strange (and yes, creepy) sentences.