Whisper API, increase file limit >25 MB

I’m currently using the Whisper API for audio transcription, and the default 25 MB file size limit poses challenges, particularly in maintaining sentence continuity when splitting files.

By default, the Whisper API only supports files that are less than 25 MB. If you have an audio file that is longer than that, you will need to break it up into chunks of 25 MB’s or less or used a compressed audio format. To get the best performance, we suggest that you avoid breaking the audio up mid-sentence as this may cause some context to be lost.

Given that the accurate transcription of lengthy audio files requires splitting them, using an external library isn’t feasible without prior transcription.

As the primary purpose of the service is transcription, I’m seeking information on increasing the file size limit to avoid disrupting the natural flow of sentences.

Could you provide guidance on this or share any plans for future updates addressing this limitation?

Is it possible simply to increase the limit?

Your assistance is greatly appreciated.

1 Like

It is possible to increase the limit to hours by re-encoding the audio.

As the primary purpose of the service is transcription, you can use voice codec and bitrate.

For example, a command to get exactly what you want.

ffmpeg -i audio.mp3 -vn -map_metadata -1 -ac 1 -c:a libopus -b:a 12k -application voip audio.ogg

Opus is one of the highest quality audio encoders at low bitrates, and is supported by Whisper in ogg container.

image

Silence detection: also a useful tool.

6 Likes

This is amazing, man. Thank you. Solved my specific problem.

However would be nice to have a more general purpose solution for unlimited size audios in the future :slight_smile:

I have an idea. You could overlap the audio files and then post-process the transcriptions through gpt-4 to remove redundancy and create a complete whole.

That sounds like a good idea at first.

But building and maintaining a library to split audio files with overlap takes up a lot of time and resources. Doing this for every programming language is even a bigger hassle and adds unnecessary complexity. Putting sentences back together after splitting becomes tricky. The flow is disrupted, and it’s not easy to maintain continuity. There is no guarantee overlapping segments will match in the resulting outputs.

Umm, this was CLUTCH. I was able to take a roughly 1.2g mp4 into an 9mb audio file :exploding_head:

2 Likes

Crazy stuff!
Thanks a tonne!

1 Like

I have never used ffmpeg before. I will use the statement above in a Python file which calls the OpenAI API. Two questions:

  1. Can I use an m4a file?
  2. I don’t understand the statement. Is it a system setting, or do I need to do something like myModifiedAudioFile = ffmpeg -i audio.mp3 -vn -map_metadata -1 -ac 1 -c:a libopus -b:a 12k -application voip audio.ogg? If so which is the input filename? I would really appreciate some direction on this.
1 Like
  1. If you are using the command mentioned above, stick to ogg. Because .m4a would mean a different way of compression, resulting into different parameters or so. If you application doesn’t permit this, use ogg for transcription and save the .m4a for other analysis.
  2. It’s a command. Just like we have ‘pip install…’ in python. There are different ways to execute a command from a .py script. The input file name is “audio.ogg” you can change the name, add directory, whatever is your requirement.

Thanks very much for your reply. I assumed audio.mp3 was the input filename but wasn’t positive. I am recording system audio on Windows 10 and the only Windows option is an m4a file. Do you think it would degrade the audio to convert from m4a to mp3 and then to ogg?

Not really, I mean it will change in the way audio is, but when we look at it from Transcription/Translation pov, it won’t matter.
But again, if you have an audio sensitive application, please keep the original file stored.

Yes, you can send and reprocess m4a, which is just mp4 that Apple renamed.

m4a generated directly by Apple device backend can fail outright due to problems with their encoding and codecs and recognition by Whisper API, many have discovered.

mp4 is a container that can contain multiple streams, and can be demuxed instead of re-encoded to get just the first audio stream into a new mp4. It will typically have AAC.

Making audio streams more efficient means re-encoding though.

FFMPEG is adaptive to the input file type, and it is only if you specify very specific parameters that don’t match the input that it will fail. The options of the command I wrote above include discarding of extra streams and metadata, summing to mono, and then encoding to the efficient passband voice codec Opus settings.

Thanks for your reply. I just changed the audio file extension from m4a to mp4 and ran you command. I worked like a charm. I got about a 14 X reduction in file size.

2 Likes

I just upgraded to python 3.9.1. Now I keep getting the error
“Unable to choose an output format for ‘ffmpeg’; use a standard extension for the filename or specify the format manually.
[out#0 @ 000001adb02c8f00] Error initializing the muxer for ffmpeg: Invalid argument
Error opening output file ffmpeg.
Error opening output files: Invalid argument”

It’s making me crazy(ier)

Depending on the standalone version of ffmpeg that is being used and the OS, you may, like the error states, need to manually specify the output format container as -f ogg if you are writing your own output file extension.

FFMPEG also needs to be new enough and be compiled with OGG and Opus support.

I downloaded new ffmpeg here.

ffmpeg-master-latest-win64-gpl-shared.zip.
I am in Windows10 using VSCode.

“D:\VS Code Projects\ffmpegdir\bin\ffmpeg.exe” ffmpeg -i input.mp3 -vn -map_metadata -1 -ac 1 -c:a libopus -b:a 12k -application voip -f ogg audio.ogg

What do you think?
BTW, I really appreciate your help on this. It’s frustrating because I am so clueless about this.

“master” means testing version with bleeding-edge changes.
“gpl” means no contentious proprietary or patented software codecs.

Here’s a site, where you’ll want to get a “full-build” “release” version of the ffmpeg exe, (from which a 2022-01-10 build is what’s running on my system).

https://www.gyan.dev/ffmpeg/builds/#release-builds

I could ZIP up that 2022 version if you want an exe from well before anyone would have thought to include an OpenAI API key stealer into a random exe. But then you’d have to trust me…

I downloaded FFMpeg from the link you suggested. Thanks for that. When I run the following command, it reads all the meat data just fine. But I still get the same error. As I say, it worked just fine in python 3.8.3. I’ll try that but changing the interpreter in VSCode doesn’t do a thing. I have to change the system path and reboot. Any other advice before I do that?
“D:\VS Code Projects\ffmpeg-7.0-essentials_build\bin\ffmpeg.exe” ffmpeg -i myAudio.mp3 -vn -map_metadata -1 -ac 1 -c:a libopus -b:a 12k -application voip -f ogg audio.ogg

Stream #0:0: Audio: mp3 (mp3float), 44100 Hz, mono, fltp, 128 kb/s
[AVFormatContext @ 00000248315d1640] Unable to choose an output format for ‘ffmpeg’; use a standard extension for the filename or specify the format manually.
[out#0 @ 00000248315d1140] Error initializing the muxer for ffmpeg: Invalid argument
Error opening output file ffmpeg.
Error opening output files: Invalid argument

What time zone are you in? Will you be around later this evening or tomorrow?

If you have a system Python installed, you can just run your .py directly, or by opening the file in IDLE 3.9, and picking “run” to execute in its print-shell. You can try that to find out if VSCode doesn’t want to trust your interpreter or external binaries or is not piping what you think it is.

Run the ffmpeg command line in a cmd.com shell simply to ensure that it alone will encode a file.