Whisper api completely wrong for mp4

Hi, I am working on a web app. The frontend is in react and the backend is in express. My backend is receiving audio files from the frontend and then using whisper to transcribe them. For webm files (which come from chrome browsers), everything works perfectly. However, for mp4 files (which come from safari because it doesn’t support webm) the transcription is completely wrong. Like not even remotely related to the actual audio. Its usually only a few words, like “Bye. Bye” or “Hello”, regardless of the length or content of the mp4 file. I’ve also verified that the audio that the server receives is correct (it isn’t being sent improperly, I can play it back from the server and I can hear it fine), but the transcription is wildly inaccurate.

Please help

2 Likes

If you use a library to transcode from mp4 to some other format/codec (mp3 or wav,) before you forward it to the Whisper model, does it work better?

Reminder: mp4 is a container, not an audio codec. It can contain multiple muxed streams, and obviously video also.

It would be similar to saying something can’t understand the contents of your zip file.

So one must look at the mp4 being generated: the muxer, interleaving, and ultimately, the audio codec, which could be anything from mp3, HE-AAC, to Apple lossless or Opus. Then clean THAT up to something whisper likes.

1 Like

It definitely seems like the Whisper API is unable to understand mp4 file types encoded with mp4a. I even tried transcoding the file into a webm/opus but it always just returns really short transcriptions that are seemingly unrelated.

“encoded with mp4a” is not a audio codec. That’s just Apple making different file names for the same .mp4, because their metadata environment of devices isn’t smart enough to figure out what’s in the file.

So, if you use a library to transcode from mp4 tomp3 or wav, before you forward it to the Whisper model, does it work better?

1 Like

I was using fluent-ffmpeg to transcode to wav, but it didn’t seem to work either. I might not be transcoding the data correctly though.

I ended up recording the audio using mic-recorder-to-mp3 on the client side instead and it works perfectly.

https://chat.openai.com/share/0bbc444d-192b-42d1-b432-e6be59ffdf09

Maybe that works?

Are there intentions to make .mp4 work with whisper? Or maybe remove the .mp4 support from the API Reference.

Using recorded mp4 audio from iPhone Safari (which is the only format supported as of now) does not give anything other than hallucinations. It works when converted into mp3 manually. Conversion on the production server would be a real pain and usage of external conversion API would be a waste and a risk.

As has been mentioned before, MP4 is a compression standard, like a zip file. Your mp4 may not be adhering to the standards expected by the decompression side of things, that is not to say that all mp4’s do not work. Does apple do something proprietary or novel with the mp4 standard? some DRM thing they build in perhaps?

Why is converting on production a real pain? Create a message, send it into a queue, let a microservice with ffmpeg create the mp3, create a message, send it to a queue and let whisper do it’s thing…

1 Like

Credit goes to @keizo, the solution for Safari mp4 files is: mediaRecorder.start(1000). This is of course after mediaDevices.getUserMedia etc.

Containers vs codecs has nothing to do with this as far as I can tell. And this doesn’t require running a microservice. It seems to me that Safari has some odd behavior around dataavailable events. MacOS 13.3.1, Safari 16.4.

Edit: A couple more notes in case it helps someone. I had Safari download the mp4 files it produced before this fix and they were perfectly playable in IINA (media player). But didn’t work well on the Whisper API. And about the mediaRecorder.addEventListener('dataavailable', (event): Safari was giving these to me weirdly, some with zero bytes, some a long time after MediaRecorder.stop(), some depending on when/whether I stop/remove the MediaStream track. This is what the fix fixed. (Safari also takes a long time to start ‘listening’ with a fresh getUserMedia().)

10 Likes

It worked! Thanks a lot, do you have the link to the original post/solution for this?

1 Like

@devicz I can’t post links, but you can use urls /t/{id} for ids 322252 and 93420.

1 Like