WebRTC full manual control use case

I am trying to use the new Webrtc APIs. I do not want the server to respond themselves, instead I want to get a transcript of what the user said, process it myself, and then tell the server what to say back to the user. Is this currently possible? I have tried

          turn_detection: {
            type: 'server_vad',
            threshold: 0.5,
            prefix_padding_ms: 300,
            silence_duration_ms: 500,
            create_response: false,
          }

But it has no effect that I can see, and also when I get the ‘session.created’ data payload, it sets it create_response to true.

If this usecase is not possible, how would you handle this flow?

Use deepgram for faster and more accurate transcriptions and get rid of the threshold, silence duration and prefix padding as these necessitate response activation through input audio.

1 Like

Hey! We are running into the same exact issue. Did you ever find a work around or how to fix it?

Wish I had better news, but no. I have not gone back to it yet, but my intention is to flirt with Deepgram until these things get ironed out. Good luck!