Hi everyone,
I’ve run into a frustrating issue with the gpt-4o-transcribe (and gpt-4o-mini-transcribe) models. No matter how I prepare my audio, the transcription output always gets truncated after about 8–9 minutes of audio.
Here’s what I’ve tried so far:
-
Converted the source video (.mkv) into clean audio chunks using ffmpeg.
-
Made sure each chunk is mono, 16kHz, normalized with loudnorm, and low/high-pass filtered for clarity.
-
Exported to .m4a (AAC) instead of MP3 to avoid VBR issues.
-
Limited file sizes to well under 25 MB.
-
Limited durations first to 1400s (~23 min), then much shorter 540s (~9 min), even down to 480s (8 min).
-
Sent requests with response_format=json instead of text.
-
Tried both gpt-4o-transcribe and gpt-4o-mini-transcribe.
Despite all of that, the API still only returns text for the first ~8-9 minutes of audio. The rest is cut off completely — no error message, just a truncated transcription.
What’s interesting:
-
The same chunks transcribed with whisper-1 return the full transcript as expected.
-
So the problem seems to be specific to the 4o models.
My questions:
-
Are there recommended best practices to avoid truncation (e.g. maximum safe segment length)?
-
Has anyone found a reliable workaround besides falling back to Whisper?
-
Or is this something the dev team is aware of and working on fixing?
Any insights would be really helpful. Thanks!