WebRTC full manual control use case

I am trying to use the new Webrtc APIs. I do not want the server to respond themselves, instead I want to get a transcript of what the user said, process it myself, and then tell the server what to say back to the user. Is this currently possible? I have tried

          turn_detection: {
            type: 'server_vad',
            threshold: 0.5,
            prefix_padding_ms: 300,
            silence_duration_ms: 500,
            create_response: false,
          }

But it has no effect that I can see, and also when I get the ‘session.created’ data payload, it sets it create_response to true.

If this usecase is not possible, how would you handle this flow?