How do you handle user transcripts in real-time GPT-4o chats?

Hey folks,
I’m building a real-time voice chat using GPT-4o with audio responses. Everything works great on the AI side — I get real-time responses back just fine.

But I’m struggling to capture the user’s spoken transcript. I’ve been using webkitSpeechRecognition in Chrome to get the user input before sending it to OpenAI, but:

  • It stops randomly (especially on silence)
  • It only works in Chrome
  • And I don’t see the user input echoed back from OpenAI

Is there any way to get the user transcript directly from the API or something more reliable/cross-browser for speech-to-text?

Would love to hear how others are handling this! :folded_hands:

1 Like

OpenAI’s Realtime API can optionally provide you the user side transcript. Can you use that? OpenAI RT API is voice-to-voice model. Optionally, OpenAI can provide you the user-side transcript by running it through a transcriber. You need to configure in session update that you need user side transcripts and also choose your model. Then, at conversation time, you need to subscribe to an event ‘response.audio_transcript.done’.

See details here https://platform.openai.com/docs/api-reference/realtime-server-events/response/audio_transcript/done