Hi, I have a web app in Nuxt 3 and the backend is in Fast API.
I tried from all the browser to record and send the audio blob from Nuxt to the Fast API endpoint which is taking in the blob, creates the temp file and feed it to whisper API. Interestingly it works for every browser except Safari on iPhones.
Every time I make a call from the Safari browser on iPhone, I get this error
on the Front end I am using MediaEncoder to take in the media stream → convert it to blob on recording stopped and send it to the Fast API endpoint.
The mime type I am setting is audio/wav.
While the same thing done from other browsers on other devices, it works perfectly fine. Do I need to do things differently? Am I missing something?
Nothing remarkable there except the duplicated moov atom (Safari bug?), so I fired up a hex editor and removed the extra moov atom. New file doesn’t have the warning, plays fine in any player, and still gives me the same error.
Copying the stream to a new file ffmpeg -i buffer.mp4 -c copy test.mp4 gives a file that works with the transcription API just fine, which leads me to conclude that something minor about Safari’s container packaging is tripping the Whisper API, but… why? whatever it is, it seems like it is not invalid.
I’d really rather not have to run my recordings through ffmpeg before submission
Apologies if this is an unhelpful comment, audio is not my domain - but is it related to the codec? When I record through chrome, I get codec = opus. When I record through safari, I get codec = aac. Chrome works, safari does not.
I thought that may be the case, but aac generated by anything other than Safari also works. In fact the stream copy experiment only changes the metadata in the file, not the coded audio, and that works too.
I thought so too that it is codec but using chrome on iPhone for this doesn’t do the trick. However, the app in chrome in any other device apart from iPhone works great.
This pipes the audio coming from safari into ffmpeg, and pipes the output of ffmpeg back into a buffer, without touching disk, and without transcoding. This is the fastest way I can think of.
The issue with piping is that ffmpeg has to do it in one swoop. Can’t write most of the file then go back to header to update it, so can’t have a moov atom. Other more natural formats than -f ipod work too if you drop the moov atom, but there seems to be a huge performance penalty. The API takes up to 30% more time to process them.
I can’t figure out how to get the Whisper API to accept the mp4 produced by Safari using the HTML5 MediaRecorder API
I am trying to use the MediaRecorder HTML5 API to record audio from the users microphone and then send it to Whisper. The mp4 file that Safari produces is rejected by the Whisper API. If I convert this file to mp3, it works fine but I need to avoid this step.
Thanks all for the comments. I tried all the possible ways but still, it doesn’t work. Tried mp3, wav, mp4 formats, but no luck. Personally, I feel it is an API issue because the audio is recorded and played but when it is sent to Whisper API it doesn’t recognise it.
The work around I am currently using until OpenAI fixes their API endpoint, is to load the MediaRecorder polyfil for Safari only:
Even though Safari now fully implements the MediaRecorder API, it is obviously producing MP4 files that OpenAI does not like. By using the polyfill, safari instead produces WAV files that OpenAI is happily accepting.
Of course the ideal solution is for OpenAI to fix their API, but for now this works. The downsides are that you have to load the polyfill (it’s quite small though) and the resulting WAV files are much larger than MP4/WEBM/Etc.
I’ve been fighting with this problem and I think there are some versions of ffmpeg that don’t work well with the aac created by safari.
Whatever version is on openai’s server might be the root problem.
I suspect this because when I compress files to send to from safari before I send to whisper it works beautifully on our dev server but not production. The only major difference I could find was that they have different versions of ffmpeg installed.
I know whisper uses ffmpeg bc I had it running locally for a while and it’s the most common way to unpack these audio files.
I did some more controlled testing with ffmpeg versions and I just wanted to confirm that older versions cannot handle the m4a created by the web audio api.
Of course, it could be something different altogether on openai’s end but if you’re trying to capture audio from the browser, this problem will likely keep coming up.
Is anyone else experiencing the issue on firefox? I have recordings working on chrome, but not safari or firefox - unclear if that is one issue or two.
I got another note on this from the server team. Apparently the Ubuntu LTS release (that 1/3 of the internet runs on) comes packaged with the older version of ffmpeg that doesn’t work with that codec. So, don’t be surprised when you struggle to record audio from iOS for the next year.