Web Speech API with whisper

I developed a system for correcting text using the Web Speech API.

It works like this:

In a browser, you click a button and start recording your voice using the Web Speech API. With each final capture, via web socket, the text is sent to OpenAI (GPT4.1) for correction. It’s simple, but it works for me.

I want real-time correction. I think I can do it with Whisper.

Do you have any real-time options, like Whisper, to simplify this?
How do you implement this?

There’s a bit of convolution in your explanation and understanding. (there’s no “web speech API”…). Let’s see if I can give you ideas.

Whisper takes audio input and outputs a transcript that is just an unformatted stream of language without paragraphs. It is solely for audio file to speech.

So, whisper is an AI model, just like gpt-4o-transcribe is an AI model, each dedicated to listening and writing what was spoken.

A follow-up AI call, making a transcript of jumbled spoken thoughts and stop words into something suitable for presentation, is certainly a good add-on.

However, that needs more input knowledge than voice-silence-detection chunks that might come from “realtime” (which is still turn-based). A “make this transcription pretty” task works best with the full contents of a transcript, making logical paragraph boundaries, with the full understanding of what is spoken, and what is yet to come when producing the language product.

So, improving smaller language snippets won’t work as good. You can get those smaller snippets by using speech-to-text in streaming - and via a backend proxy, not as a client that makes API calls directly.

https://platform.openai.com/docs/guides/speech-to-text#streaming