Send an hours worth of audio through Whisper using node.js

Either chop it into 25M chunks, and risk cutting it on a word, or use pydub, or similar libraries, and break it on a silent 25M boundary. (ref)

4 Likes

It’s a quasi-assumption that all AI based forums use Python. :rofl:

Gosh, I really don’t know what open source thing is out there. I would just risk it personally then. The really real question you have to ask yourself, do you really care if 1 word out of 25 Megs of data is off? I wouldn’t since the model WER is higher than you think and messing up more than you think.

So just chop the files up in the most appropriate way in the language you want, IMO.

3 Likes

This is using ffmpeg-pac, have not tried it but I wrote it down in my notes. I probably found it somewhere here before:

Edit:
I am trying to find the module ffmpeg-pac but I cannot find it. I think this code is probably generated by ChatGPT. So, I removed it. Anyway, the command line way to split using ffmpeg is:

ffmpeg -i "input_audio_file.mp3" -f segment -segment_time 3600 -c copy output_audio_file_%03d.mp3
3 Likes

I tried to find other solution without using any external module/library. Here is something I tried and tested. I am assuming you have ffmpeg installed in the backend.

import fs from 'fs'
import path from 'path'
import { exec } from 'child_process'

const sourceAudio = path.join('public', 'audio', 'Amore.mp3')
const outputAudio = path.join('public', 'audio', 'Amore-segment_%03d.mp3')

const ret = await new Promise((resolve, reject) => {
        // 120 second segments
        const sCommand = `ffmpeg -i "${sourceAudio}" -f segment -segment_time 120 -c copy ${outputAudio}`

        exec(sCommand, (error, stdout, stderr) => {
            
            if (error) {
                
                resolve({
                    status: 'error',
                })

            } else {

                resolve({
                    status: 'success',
                    error: stderr,
                    out: stdout,
                })

            }
            
        })

})

Result

Original:
Amore.mp3, 3:07, 4.4MB

Output:
Amore-segment_000.mp3, 2:00, 2.8MB
Amore-segment_001.mp3, 1:07, 1.6MB
3 Likes

For reliable use of ffmpeg, there can be other streams in the input file, like m4a that has mjpeg icons (video), and other metadata that corrupts and wastes space, that must be not passed to the output.

Another thing you can do is recompress with ffmpeg.

I take a 64k stereo mp3 and mash it with OPUS in an OGG container down to 12kbps mono, also using the speech optimizations. Command line is below:

ffmpeg -i audio.mp3 -vn -map_metadata -1 -ac 1 -c:a libopus -b:a 12k -application voip audio.ogg

Opus is the highest quality at low bitrates, and is supported by whisper in ogg container.

(Conversion log)
Input #0, mp3, from 'audio.mp3':
  Duration: 00:00:27.74, start: 0.000000, bitrate: 64 kb/s
  Stream #0:0: Audio: mp3, 44100 Hz, stereo, fltp, 64 kb/s
File 'audio12.ogg' already exists. Overwrite? [y/N] y
Stream mapping:
  Stream #0:0 -> #0:0 (mp3 (mp3float) -> opus (libopus))
Press [q] to stop, [?] for help
Output #0, ogg, to 'audio12.ogg':
  Metadata:
    encoder         : Lavf59.17.100
  Stream #0:0: Audio: opus, 48000 Hz, mono, flt, 12 kb/s
    Metadata:
      encoder         : Lavc59.20.100 libopus
size=      43kB time=00:00:27.75 bitrate=  12.6kbits/s speed=48.9x

Comparing two transcriptions, re-encoded version (top) is actually more accurate to the start of the audio:

{
“text”: “that this is a radio show where people call us and ask us questions about cars, right? And what were we just talking about before the mics came on? We were both talking about what’s wrong with our respective vehicles. This has happened in the mind that charges their systems aren’t working. It’s pretty sad. Well, my real question is, who do we call? Who do we call? I call you when I have a problem.”
}
{
“text”: “This is a radio show where people call us and ask us questions about cars, right? And what were we just talking about before the mics came on? We were both talking about what’s wrong with our respective vehicles. This has happened in the mind that charges their systems aren’t working. It’s pretty sad. Well, the real question is, who do we call? Who do we call? I call you when I have a problem.”
}

Encoding 3.5 hours of Howard Stern AAC to Opus (which would be a $1.25 transcript). 86MB to 19MB (and the stripping of the multimedia above was required to make it play in foobar2000 and leaves more audio bits) (PS, don’t do this, you’ll likely get an API timeout)

2 Likes

Thanks.
But I have a qustion.
Can I use file size instead of time in this command?
Openai whisper limit the file size as 25MB, so I need to split the large audio file into chunk. In this way, if I can use file size value instead of time value, it would be great.
Please help me.

Due to the nature of the data (audio) it is more logical to split your file by time and there is no direct way using ffmpeg to do it. But by experience, the chunks output using the time approach has similar file size. So I would suggest that perhaps you can approximate how long can be fitted within 25MB of your data and just use it.

Thanks very much for your help.
I will ping you later if you have any other question.
Thanks again.