Can I replace OpenAI's Whisper transcription in real-time WebRTC chat with a custom transcription function?

I’m working on a real-time voice chat using realtimeapi WebRTC. Currently, audio is being transcribed using whisper model. But the transcription quality isn’t ideal for my use case.

Is it possible to replace or bypass the Whisper transcription and instead provide my own speech-to-text function (e.g., from a Lambda endpoint or custom backend)? I noticed the transcript seems to happen at OpenAI’s end and that WebRTC streams audio directly to OpenAI (correct me if im wrong), so I’m wondering if there’s a way to intercept the audio and insert my own transcription instead.

Here’s a snippet I’m working with:
const event = {
type: “session.update”,
session: {
instructions: instructions,
input_audio_transcription: { model: “whisper-1” },
},
};