Impact of WAV vs M4A on Whisper Transcription Quality

Hi,

I was wondering if there’s a difference in the quality of transcription between using WAV or M4A for audio recordings.

Will Whisper produce better text output when using a WAV file?

Thank you.

WAV are typically uncompressed but the difference should be minor and the speed factor from the mp4 should easily make that up.

It really comes out to the quality of the recording.

Most of the times the model doe n rely ned to knw th complet transcr. and it will still understand :wink:

Also something completely unrelated:

Tips and Tricks for STT-GPT-RAG-TTS or How to Get Faster Responses and Lower Costs:

  1. Do what developers always do! Caching!

    We save the generated response as an MP3 and store the file path in the file storage, embedding it in the graph of the RAG system. Nodes are then added to the VectorDB.

  2. Prediction!

    We
    already
    know — prediction 1
    what
    the
    conversation
    partner
    will
    say next — prediction 2.

    Prediction 2 could already foresee that the sentence will end with “say next” and start generating the response before the sentence is even fully spoken.

  3. Filler Phrases / Sentence Starters / Scenes

    Instead of waiting for the response to be delivered, you can play MP3 files simulating scenes. For example, the voice bot could “drop” something: “clatter - uhm, oops - err, where is it - ah, here, sorry I dropped something - where were we, oh right - mention the node and then give the response.”

    Or use filler phrases (pre-recorded MP3 files with slight delays or played immediately): “Sure, I can tell you something about that,” or “Interesting point,” or “uhh” - then play the actual response.

    Another approach is pre-caching sentence starters.
    Give GPT a sentence starter we’ve cached and say, “Complete the answer for [topic], begin output after: [general sentence starter].”
    This way, the starter can already play while the rest of the sentence is being generated.

    Or try a hybrid approach:
    If the output takes longer than expected, fill in with “uhhh”, check if the answer is ready, and if not, play a scene.