Impact of WAV vs M4A on Whisper Transcription Quality

yancheng.cheok · September 6, 2024, 4:41am

Hi,

I was wondering if there’s a difference in the quality of transcription between using WAV or M4A for audio recordings.

Will Whisper produce better text output when using a WAV file?

Thank you.

jochenschultz · September 6, 2024, 5:28am

WAV are typically uncompressed but the difference should be minor and the speed factor from the mp4 should easily make that up.

It really comes out to the quality of the recording.

Most of the times the model doe n rely ned to knw th complet transcr. and it will still understand

Also something completely unrelated:

Tips and Tricks for STT-GPT-RAG-TTS or How to Get Faster Responses and Lower Costs:

Do what developers always do! Caching!

We save the generated response as an MP3 and store the file path in the file storage, embedding it in the graph of the RAG system. Nodes are then added to the VectorDB.
Prediction!

We
already
know — prediction 1
what
the
conversation
partner
will
say next — prediction 2.

Prediction 2 could already foresee that the sentence will end with “say next” and start generating the response before the sentence is even fully spoken.
Filler Phrases / Sentence Starters / Scenes

Instead of waiting for the response to be delivered, you can play MP3 files simulating scenes. For example, the voice bot could “drop” something: “clatter - uhm, oops - err, where is it - ah, here, sorry I dropped something - where were we, oh right - mention the node and then give the response.”

Or use filler phrases (pre-recorded MP3 files with slight delays or played immediately): “Sure, I can tell you something about that,” or “Interesting point,” or “uhh” - then play the actual response.

Another approach is pre-caching sentence starters.
Give GPT a sentence starter we’ve cached and say, “Complete the answer for [topic], begin output after: [general sentence starter].”
This way, the starter can already play while the rest of the sentence is being generated.

Or try a hybrid approach:
If the output takes longer than expected, fill in with “uhhh”, check if the answer is ready, and if not, play a scene.

Topic		Replies	Views
Does audio file size have any impact on Whisper performance? API whisper	4	3872	December 18, 2023
Extracting Transcription Without Using input_audio.input_transcription in OpenAI API API realtime , api-realtime	10	315	March 11, 2025
What minimum bitrate should I use for whisper? API whisper	3	3650	December 18, 2023
Gpt-4o-mini-transcribe and gpt-4o-transcribe not as good as whisper Feedback api	3	1615	April 23, 2025
Whisper api completely wrong for mp4 API whisper	14	5295	December 15, 2023

Do what developers always do! Caching!