The seq2seq Whisper Models have clearly taken a back-seat to LLM based speech to text architectures at OpenAI, and in the meantime we’re facing an innovation gap for non-streaming/real-time ASR applications. Does the community have any insights on how OpenAI is looking to close the gap here? Are we expecting a re-trained/updated Whisper at some point or will we need to start pivoting towards a multi-modal model approach?
Havn’t seen any updates on that - but maybe you could describe what you are missing to give them some ideas?
One thing I’d love to do, not sure if it is possible, is to send GPT4o-Audio the transcription I got from Whisper, and give as instruction “there is mistakes in this text, I will read the correct version to you, please correct it” ← simplified
Then send the same dictation that was sent to whisper the first time.
Is this something currently achievable ?
It sounds like you want to give a fixed up transcript along with the same audio again, that this would have some character of improvement to it. Whisper does not take any instructional input, though. It only receives audio.
The API’s prompt field for the AI model is exclusively used for the prior lead-up audio transcript that preceded the provided audio segment. That allows you to give some context of the prior Whisper transcript when you are splitting audio into smaller API calls. However, you can engineer that field a bit, especially for improving cold-start transcriptions, such as introducing the speaker and the language they are speaking as if that was also part of the audio transcript.
Thus, the only use of an outside language model is indeed to rewrite anything that might have been spoken but was mis-transcribed, but Whisper already has a bit of its own language interpolation skill behind its intelligence. An AI language model can make the spoken word more presentable as writing, though, as often what is said could be rephrased, and the presentation of information can be better when split into paragraphs and sections.
Not to Whisper, but to GPT4o-Audio : https://platform.openai.com/docs/models/gpt-4o-audio-preview
I would send the output of whisper (an automatic transcription) to GPT4o-Audio + “read” him the text dictated.
Sending your audio again to a gpt-4o-audio-preview would amplify the costs significantly - a factor of 1x - 5x if you go by OpenAI’s estimates. GPT-4o also is not a good transcriber or audio-understanding AI, as many here report bad information being recited from spoken audio, such as phone numbers, so it kind of defeats the purpose of obtaining a high quality transcription - it is for chatting.
I would not expect gpt-4o-audio to be trained well on doing something with both audio and text of the same source.
Model | Input | Output |
---|---|---|
gpt-4o-audio-preview-2024-12-17 | $40.00 | $80.00 |
gpt-4o-audio-preview-2024-10-01 | $100.00 | $200.00 |
gpt-4o-mini-audio-preview-2024-12-17 | $10.00 | $20.00 |
Well, that is expectation. It could turn out amazing !
Being a medical transcription business, having the best quality output is a key value in my business. I’ll try to convince my developers to try it out.
If I succeed, I’ll post the conclusion about it here.
This is something that I’m working on now. There are a few papers that indicate passing an n-best list and/or confidence scores can help a downstream LLM improve transcription and diarization errors. Providing targeted (low confidence) segments can help localize any corrections and limit any creativity/drift from the original source as well.
Of course, brilliant! This makes a lot of sense.
By any chance, would you have the links to those papers?