Whisper API cannot read files correctly

Hi, I have a web app in Nuxt 3 and the backend is in Fast API.
I tried from all the browser to record and send the audio blob from Nuxt to the Fast API endpoint which is taking in the blob, creates the temp file and feed it to whisper API. Interestingly it works for every browser except Safari on iPhones.
Every time I make a call from the Safari browser on iPhone, I get this error

openai.error.InvalidRequestError: Invalid file format. Supported formats: ['m4a', 'mp3', 'webm', 'mp4', 'mpga', 'wav', 'mpeg']

on the Front end I am using MediaEncoder to take in the media stream → convert it to blob on recording stopped and send it to the Fast API endpoint.
The mime type I am setting is audio/wav.

While the same thing done from other browsers on other devices, it works perfectly fine. Do I need to do things differently? Am I missing something?

5 Likes

Same here. I thought maybe Safari on iPhone was declaring the wrong format, so I saved the blob and ran it through ffmpeg, here’s what I got:

[mov,mp4,m4a,3gp,3g2,mj2 @ 00000230b38e01c0] Found duplicated MOOV Atom. Skipped it
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'filename':
  Metadata:
    creation_time   : 2023-03-12T01:37:10.000000Z
    major_brand     : iso5
    minor_version   : 1
    compatible_brands: isomiso5hlsf
  Duration: 00:00:06.22, start: 0.000000, bitrate: 211 kb/s
  Stream #0:0[0x1](und): Audio: aac (LC) (mp4a / 0x6134706D), 44100 Hz, mono, fltp, 207 kb/s (default)
    Metadata:
      creation_time   : 2023-03-12T01:37:10.000000Z
      handler_name    : Core Media Audio
      vendor_id       : [0][0][0][0]

Nothing remarkable there except the duplicated moov atom (Safari bug?), so I fired up a hex editor and removed the extra moov atom. New file doesn’t have the warning, plays fine in any player, and still gives me the same error.
Copying the stream to a new file ffmpeg -i buffer.mp4 -c copy test.mp4 gives a file that works with the transcription API just fine, which leads me to conclude that something minor about Safari’s container packaging is tripping the Whisper API, but… why? whatever it is, it seems like it is not invalid.

I’d really rather not have to run my recordings through ffmpeg before submission :confused:

2 Likes

Apologies if this is an unhelpful comment, audio is not my domain - but is it related to the codec? When I record through chrome, I get codec = opus. When I record through safari, I get codec = aac. Chrome works, safari does not.

I thought that may be the case, but aac generated by anything other than Safari also works. In fact the stream copy experiment only changes the metadata in the file, not the coded audio, and that works too.

I thought so too that it is codec but using chrome on iPhone for this doesn’t do the trick. However, the app in chrome in any other device apart from iPhone works great.

Does anyone know where I can report this as a bug to OpenAI?

In the meantime, here’s a workaround using ffmpeg:

const { spawn } = require('child_process');
const { Readable } = require('stream');
const { Buffer } = require('buffer');


async function pipecopy(inputBuffer) {
  const ffmpeg = spawn('ffmpeg', [
    '-hide_banner',
    '-i', 'pipe:0',
    '-codec', 'copy',
    '-movflags', 'empty_moov',
    '-f', 'ipod', 
    'pipe:1'
  ]);

  const stream = new Readable();
  stream._read = () => {};

  ffmpeg.stdout.on('data', (data) => {
    stream.push(data);
  });

  ffmpeg.on('exit', (code, signal) => {
    console.log(`ffmpeg process exited with code ${code} and signal ${signal}`);
    stream.push(null)
  });

  ffmpeg.stdin.write(inputBuffer);
  ffmpeg.stdin.end();

  const buffer = await new Promise((resolve, reject) => {
    let chunks = [];
    stream.on('data', (chunk) => {
      chunks.push(chunk);
    });
    stream.on('end', () => {
      resolve(Buffer.concat(chunks));
    });
    stream.on('error', (err) => {
      reject(err);
    });
  });

  return buffer;
}


module.exports = pipecopy

This pipes the audio coming from safari into ffmpeg, and pipes the output of ffmpeg back into a buffer, without touching disk, and without transcoding. This is the fastest way I can think of.

The issue with piping is that ffmpeg has to do it in one swoop. Can’t write most of the file then go back to header to update it, so can’t have a moov atom. Other more natural formats than -f ipod work too if you drop the moov atom, but there seems to be a huge performance penalty. The API takes up to 30% more time to process them.

1 Like

Same Issue here - with output recorded directly in Logic Pro.

1 Like

I have the same issue.
Not working only on Safari.

Whole component in a gist if somebody would like to take a look.

WIP :slight_smile: Testing how it can work in React and Next.

same error on my side. Works on chrome but not on safari recorded audio. any tips? chatgpt4 didnt help hehe.

This is effecting me as well.

I can’t figure out how to get the Whisper API to accept the mp4 produced by Safari using the HTML5 MediaRecorder API

I am trying to use the MediaRecorder HTML5 API to record audio from the users microphone and then send it to Whisper. The mp4 file that Safari produces is rejected by the Whisper API. If I convert this file to mp3, it works fine but I need to avoid this step.

Thanks all for the comments. I tried all the possible ways but still, it doesn’t work. Tried mp3, wav, mp4 formats, but no luck. Personally, I feel it is an API issue because the audio is recorded and played but when it is sent to Whisper API it doesn’t recognise it.

1 Like

I’m also facing the same issue when using the MediaRecorder in Safari with MP4s.

The work around I am currently using until OpenAI fixes their API endpoint, is to load the MediaRecorder polyfil for Safari only:

Even though Safari now fully implements the MediaRecorder API, it is obviously producing MP4 files that OpenAI does not like. By using the polyfill, safari instead produces WAV files that OpenAI is happily accepting.

Of course the ideal solution is for OpenAI to fix their API, but for now this works. The downsides are that you have to load the polyfill (it’s quite small though) and the resulting WAV files are much larger than MP4/WEBM/Etc.

2 Likes

I’ve been fighting with this problem and I think there are some versions of ffmpeg that don’t work well with the aac created by safari.
Whatever version is on openai’s server might be the root problem.

I suspect this because when I compress files to send to from safari before I send to whisper it works beautifully on our dev server but not production. The only major difference I could find was that they have different versions of ffmpeg installed.

I know whisper uses ffmpeg bc I had it running locally for a while and it’s the most common way to unpack these audio files.

We’re seeing the same issue.

Fix from OpenAI would be ideal. The polyfill looks like the second best solution.

I’m seeing same issue. Hoping it’s fixed on api side. Thank you.

I did some more controlled testing with ffmpeg versions and I just wanted to confirm that older versions cannot handle the m4a created by the web audio api.
Of course, it could be something different altogether on openai’s end but if you’re trying to capture audio from the browser, this problem will likely keep coming up.

1 Like

Is anyone else experiencing the issue on firefox? I have recordings working on chrome, but not safari or firefox - unclear if that is one issue or two.

Yes, experiencing issues on Firefox with webm as mime type - going to iterate through others…

I got another note on this from the server team. Apparently the Ubuntu LTS release (that 1/3 of the internet runs on) comes packaged with the older version of ffmpeg that doesn’t work with that codec. So, don’t be surprised when you struggle to record audio from iOS for the next year.

Here’s a good resource for working with the client end of the problem but you will still struggle sending the audio to openai: