For reliable use of ffmpeg, there can be other streams in the input file, like m4a that has mjpeg icons (video), and other metadata that corrupts and wastes space, that must be not passed to the output.
Another thing you can do is recompress with ffmpeg.
I take a 64k stereo mp3 and mash it with OPUS in an OGG container down to 12kbps mono, also using the speech optimizations. Command line is below:
ffmpeg -i audio.mp3 -vn -map_metadata -1 -ac 1 -c:a libopus -b:a 12k -application voip audio.ogg
Opus is the highest quality at low bitrates, and is supported by whisper in ogg container.
(Conversion log)
Input #0, mp3, from 'audio.mp3':
Duration: 00:00:27.74, start: 0.000000, bitrate: 64 kb/s
Stream #0:0: Audio: mp3, 44100 Hz, stereo, fltp, 64 kb/s
File 'audio12.ogg' already exists. Overwrite? [y/N] y
Stream mapping:
Stream #0:0 -> #0:0 (mp3 (mp3float) -> opus (libopus))
Press [q] to stop, [?] for help
Output #0, ogg, to 'audio12.ogg':
Metadata:
encoder : Lavf59.17.100
Stream #0:0: Audio: opus, 48000 Hz, mono, flt, 12 kb/s
Metadata:
encoder : Lavc59.20.100 libopus
size= 43kB time=00:00:27.75 bitrate= 12.6kbits/s speed=48.9x
Comparing two transcriptions, re-encoded version (top) is actually more accurate to the start of the audio:
{
“text”: “that this is a radio show where people call us and ask us questions about cars, right? And what were we just talking about before the mics came on? We were both talking about what’s wrong with our respective vehicles. This has happened in the mind that charges their systems aren’t working. It’s pretty sad. Well, my real question is, who do we call? Who do we call? I call you when I have a problem.”
}
{
“text”: “This is a radio show where people call us and ask us questions about cars, right? And what were we just talking about before the mics came on? We were both talking about what’s wrong with our respective vehicles. This has happened in the mind that charges their systems aren’t working. It’s pretty sad. Well, the real question is, who do we call? Who do we call? I call you when I have a problem.”
}
Encoding 3.5 hours of Howard Stern AAC to Opus (which would be a $1.25 transcript). 86MB to 19MB (and the stripping of the multimedia above was required to make it play in foobar2000 and leaves more audio bits) (PS, don’t do this, you’ll likely get an API timeout)