GPT4.0-Transcribe—MAX 1500 SECONDS?

I was really excited to use GPT 4.0 Transcribe instead of whisper but i just ran into a major roadblock and would appreciate some insight from some other people.

See image.

Im getting “400 audio duration 3189.204 seconds is longer than 1500 seconds which is the maximum for this model” which is really unfortunate because i spend some time already getting the file size down using FFMPEG command.

the 2 options as i see it are:

Option 1: Split Long Files

  • Break files into <25-minute chunks before uploading

Option 2: Use Whisper-1 Instead

  • Whisper-1 has no duration limit (just 25MB file size limit)

If there is no other option i will just choose option 2 because its less of a headache, unless GPT4.0 transcribe is way way way better. I see people online speeding up their audio files for cost purposes. I dont care about the cost but if i can do that so it can fit inside GPT transcribe, is it worth it? will speeding it up defeat the purpose of using GPT Transcribe instead of whisper for accuracy?

If speeding it up is actually a viable option, should i undo the ffmpeg that set all the files to

  • sampling rate to 16,000 Hz
  • audio bitrate to 32 kbps.

Keep in mind my longest audio file is 1 hour 16 minutes so even 2x speed wouldnt get it under 25 minutes.

Let me know if anyone sees a workaround for this please!

Sorry for any bad spelling or grammar,

Nick

Hi Nick,

If cost is not an issue, then the recommendation is to use the GPT 4.0 Transcribe model, as it “offers improvements to word error rate and better language recognition and accuracy compared to original Whisper models.” (Source: Model Card).

Technically you could speed up the audio, but as you highlighted, your longest audio file will still not fit with 2x speed. If it’s not too much work, it would be best to chunk the file yourself using any video/audio editing software and have GPT 4.0 Transcribe model transcribe each of those chunks. If speed is the need of the hour, you could make parallel API calls as well and stitch the text that you get back from each model together. In theory this would take the same amount of time as one API call made for sped-up audio file, so it’s only the cost piece that you would have to think about.

To start with it, you could try different models with say 3 one-minute long snippets sampled from different parts of your file and compare the results. There is also a GPT 4.0 mini Transcribe model that you can explore, and it might just fit the bill.

1 Like

Yes, I actually just designed a pre-processor thats going to keep the file size under 25MB and also speed it up by 3x so the max would be 75 minutes, if greater it will still process the file but it wont send it to the API, it will tell me that i need to manually chunk it, if it is less than 75 minutes it will send to GPT transcribe.

Thank you for the response!

There’s also Whisper, which simply works, and doesn’t have a context window limit by repurposing a multimodal AI as its own endpoint. Its documentation is actually correct without "gotcha"s.

With Whisper, I’ve had over three hours of transcript returned, which you can transmit with optimized opus → ogg.

1 Like