I’ve observed two issues with realtime turn-taking that I’d like to report.
-
The
interrupt_responseconfiguration doesn’t seem to have any effect. I’ve tried setting it to False, but the LLM will still stop talking if I start talking over it. Am I misunderstanding the purpose of this configuration? I expected the LLM to finish speaking its whole response with this config set to False. I’ve observed this using both server and semantic VAD. -
Semantic VAD often doesn’t pick up on short utterances by the caller, like “yeah” or “sure”. I noticed this problem because I have a tool call that sends an SMS that I want the LLM to make sure the caller explicitly opts into. So the LLM says something like, “Do you want me to send that text to your phone?” If the caller just says “yeah”, no input is registered. There’s no
input_audio_buffer.speech_startedevent, the caller’s turn isn’t registered, so then there’s silence on the call until the user says something more obvious like, “Yes, send the text.” Server VAD does a good job of picking up on these short utterances so I’ve switched back to using it for now. When I was using semantic VAD I had eagerness set to high.