Speech to Text (ASR) Strategy

justinmadalone · March 5, 2025, 1:25pm

The seq2seq Whisper Models have clearly taken a back-seat to LLM based speech to text architectures at OpenAI, and in the meantime we’re facing an innovation gap for non-streaming/real-time ASR applications. Does the community have any insights on how OpenAI is looking to close the gap here? Are we expecting a re-trained/updated Whisper at some point or will we need to start pivoting towards a multi-modal model approach?

jochenschultz · March 5, 2025, 11:26pm

Havn’t seen any updates on that - but maybe you could describe what you are missing to give them some ideas?

Universus · March 10, 2025, 3:30am

One thing I’d love to do, not sure if it is possible, is to send GPT4o-Audio the transcription I got from Whisper, and give as instruction “there is mistakes in this text, I will read the correct version to you, please correct it” ← simplified
Then send the same dictation that was sent to whisper the first time.

Is this something currently achievable ?

_j · March 10, 2025, 3:46am

It sounds like you want to give a fixed up transcript along with the same audio again, that this would have some character of improvement to it. Whisper does not take any instructional input, though. It only receives audio.

The API’s prompt field for the AI model is exclusively used for the prior lead-up audio transcript that preceded the provided audio segment. That allows you to give some context of the prior Whisper transcript when you are splitting audio into smaller API calls. However, you can engineer that field a bit, especially for improving cold-start transcriptions, such as introducing the speaker and the language they are speaking as if that was also part of the audio transcript.

Thus, the only use of an outside language model is indeed to rewrite anything that might have been spoken but was mis-transcribed, but Whisper already has a bit of its own language interpolation skill behind its intelligence. An AI language model can make the spoken word more presentable as writing, though, as often what is said could be rephrased, and the presentation of information can be better when split into paragraphs and sections.

Universus · March 10, 2025, 4:01am

Not to Whisper, but to GPT4o-Audio : https://platform.openai.com/docs/models/gpt-4o-audio-preview

I would send the output of whisper (an automatic transcription) to GPT4o-Audio + “read” him the text dictated.

_j · March 10, 2025, 4:39am

Sending your audio again to a gpt-4o-audio-preview would amplify the costs significantly - a factor of 1x - 5x if you go by OpenAI’s estimates. GPT-4o also is not a good transcriber or audio-understanding AI, as many here report bad information being recited from spoken audio, such as phone numbers, so it kind of defeats the purpose of obtaining a high quality transcription - it is for chatting.

I would not expect gpt-4o-audio to be trained well on doing something with both audio and text of the same source.

Model	Input	Output
gpt-4o-audio-preview-2024-12-17	$40.00	$80.00
gpt-4o-audio-preview-2024-10-01	$100.00	$200.00
gpt-4o-mini-audio-preview-2024-12-17	$10.00	$20.00

Universus · March 10, 2025, 4:56am

Well, that is expectation. It could turn out amazing !
Being a medical transcription business, having the best quality output is a key value in my business. I’ll try to convince my developers to try it out.

If I succeed, I’ll post the conclusion about it here.

justinmadalone · March 10, 2025, 1:31pm

This is something that I’m working on now. There are a few papers that indicate passing an n-best list and/or confidence scores can help a downstream LLM improve transcription and diarization errors. Providing targeted (low confidence) segments can help localize any corrections and limit any creativity/drift from the original source as well.

Universus · March 10, 2025, 11:25pm

Of course, brilliant! This makes a lot of sense.
By any chance, would you have the links to those papers?

Topic		Replies	Views
GPT-4o-transcribe and audio model ready to use via API? API transcribe	10	4228	March 17, 2026
Gpt-4o or whisper for kids speech Community whisper , audio	4	1275	July 12, 2024
Best solution for Whisper diarization/speaker labeling? API whisper	23	46117	April 9, 2026
Web Speech API with whisper API whisper	1	743	July 24, 2025
Whisper hallucinations + dropped sentences: Help? API whisper	3	4279	February 29, 2024

Speech to Text (ASR) Strategy

Related topics