Sending an hours worth of audio through Whisper using node.js

I’m working on a project utilizing Node.js, React, and Express, where we aim to send audio conversations to the Whisper API. However, we’ve encountered a challenge due to the API’s 25MB data limit. Our conversations often surpass this limit, and we’re in a bind on how best to chunk or split the audio effectively. The goal is to ensure we adhere to the size constraint without losing the context and continuity of the conversations.

Has anyone in the community faced this particular issue and successfully navigated it? We’re seeking guidance on how to manage and send longer conversations to the Whisper API using our tech stack.

Any insights, best practices, or proven solutions from those who’ve tackled this before would be immensely valuable.

Thank you in advance for your help!

Either chop it into 25M chunks, and risk cutting it on a word, or use pydub, or similar libraries, and break it on a silent 25M boundary. (ref)

4 Likes

Not sure if i can use pydub as I am not in python

2 Likes

It’s a quasi-assumption that all AI based forums use Python. :rofl:

Gosh, I really don’t know what open source thing is out there. I would just risk it personally then. The really real question you have to ask yourself, do you really care if 1 word out of 25 Megs of data is off? I wouldn’t since the model WER is higher than you think and messing up more than you think.

So just chop the files up in the most appropriate way in the language you want, IMO.

3 Likes

This is using ffmpeg-pac, have not tried it but I wrote it down in my notes. I probably found it somewhere here before:

Edit:
I am trying to find the module ffmpeg-pac but I cannot find it. I think this code is probably generated by ChatGPT. So, I removed it. Anyway, the command line way to split using ffmpeg is:

ffmpeg -i "input_audio_file.mp3" -f segment -segment_time 3600 -c copy output_audio_file_%03d.mp3
3 Likes

Thanks! We will try implementing this solution and I’ll be sure to get back if it works out

1 Like

I tried to find other solution without using any external module/library. Here is something I tried and tested. I am assuming you have ffmpeg installed in the backend.

import fs from 'fs'
import path from 'path'
import { exec } from 'child_process'

const sourceAudio = path.join('public', 'audio', 'Amore.mp3')
const outputAudio = path.join('public', 'audio', 'Amore-segment_%03d.mp3')

const ret = await new Promise((resolve, reject) => {
        // 120 second segments
        const sCommand = `ffmpeg -i "${sourceAudio}" -f segment -segment_time 120 -c copy ${outputAudio}`

        exec(sCommand, (error, stdout, stderr) => {
            
            if (error) {
                
                resolve({
                    status: 'error',
                })

            } else {

                resolve({
                    status: 'success',
                    error: stderr,
                    out: stdout,
                })

            }
            
        })

})

Result

Original:
Amore.mp3, 3:07, 4.4MB

Output:
Amore-segment_000.mp3, 2:00, 2.8MB
Amore-segment_001.mp3, 1:07, 1.6MB
3 Likes

For reliable use of ffmpeg, there can be other streams in the input file, like m4a that has mjpeg icons (video), and other metadata that corrupts and wastes space, that must be not passed to the output.

Another thing you can do is recompress with ffmpeg.

I take a 64k stereo mp3 and mash it with OPUS in an OGG container down to 12kbps mono, also using the speech optimizations. Command line is below:

ffmpeg -i audio.mp3 -vn -map_metadata -1 -ac 1 -c:a libopus -b:a 12k -application voip audio.ogg

Opus is the highest quality at low bitrates, and is supported by whisper in ogg container.

(Conversion log)
Input #0, mp3, from 'audio.mp3':
  Duration: 00:00:27.74, start: 0.000000, bitrate: 64 kb/s
  Stream #0:0: Audio: mp3, 44100 Hz, stereo, fltp, 64 kb/s
File 'audio12.ogg' already exists. Overwrite? [y/N] y
Stream mapping:
  Stream #0:0 -> #0:0 (mp3 (mp3float) -> opus (libopus))
Press [q] to stop, [?] for help
Output #0, ogg, to 'audio12.ogg':
  Metadata:
    encoder         : Lavf59.17.100
  Stream #0:0: Audio: opus, 48000 Hz, mono, flt, 12 kb/s
    Metadata:
      encoder         : Lavc59.20.100 libopus
size=      43kB time=00:00:27.75 bitrate=  12.6kbits/s speed=48.9x

Comparing two transcriptions, re-encoded version (top) is actually more accurate to the start of the audio:

{
“text”: “that this is a radio show where people call us and ask us questions about cars, right? And what were we just talking about before the mics came on? We were both talking about what’s wrong with our respective vehicles. This has happened in the mind that charges their systems aren’t working. It’s pretty sad. Well, my real question is, who do we call? Who do we call? I call you when I have a problem.”
}
{
“text”: “This is a radio show where people call us and ask us questions about cars, right? And what were we just talking about before the mics came on? We were both talking about what’s wrong with our respective vehicles. This has happened in the mind that charges their systems aren’t working. It’s pretty sad. Well, the real question is, who do we call? Who do we call? I call you when I have a problem.”
}

Encoding 3.5 hours of Howard Stern AAC to Opus (which would be a $1.25 transcript). 86MB to 19MB (and the stripping of the multimedia above was required to make it play in foobar2000 and leaves more audio bits) (PS, don’t do this, you’ll likely get an API timeout)

2 Likes

Great! Yes we were struggling to find the -pac library, but I know of ffmpeg. Okay we will try the code you presented here!

Thanks.
But I have a qustion.
Can I use file size instead of time in this command?
Openai whisper limit the file size as 25MB, so I need to split the large audio file into chunk. In this way, if I can use file size value instead of time value, it would be great.
Please help me.

Due to the nature of the data (audio) it is more logical to split your file by time and there is no direct way using ffmpeg to do it. But by experience, the chunks output using the time approach has similar file size. So I would suggest that perhaps you can approximate how long can be fitted within 25MB of your data and just use it.

Thanks very much for your help.
I will ping you later if you have any other question.
Thanks again.