I’m using the transcription endpoint at /v1/audio/transcriptions. I’ve noticed that when I use the gpt-4o-transcribe or gpt-4o-mini-transcribe models with this endpoint, the transcript that I receive back is truncated - usually sometime between 10-11 minutes into my 12-minute recording. However, when I use the whisper-1 model, I get the full transcript back.
To clarify - the JSON response itself is well-formed and complete; it’s just that the text property of the response doesn’t have the entire content of the audio file.
I’ve tried audio files of different formats and sizes - mp3, mp4/aac, lower bitrates to reduce file size, mono vs. stereo, etc. But I can’t seem to get it to process the entire file. The file sizes I’ve tried have ranged from 8.5 MB to 17.0 MB.
Naturally, I could just use the whisper-1 model, but the other models have been more accurate for the niche topics I’m discussing in the recording, so I’d prefer to use them.
Has anyone else run into this issue, or have any ideas how to work around it?
Another q, have you tried .ogg in the new gpt-4o-transcribe? whisper had no problem and its how I’m gettign around the 25MB limit (by compressing the audio as much as possible) I get an error in the new model when I try upload .ogg
Same problem here. Uploading an mp3 which is well within the 25 MB limit, yet the transcription cuts off just under 10 minutes. I tried it with two audio files, one about 10m30s and the other ~12m and both cut off about the same time.
Iv had the same issue. Im using the MediaStreamTimeProcessor recorder that autoslices the audio every 45 sec. With whisper-1 it works perfectly, and I dont experience any truncated transcripts. However with the new gpt-4-transribe model, my audiochunk transcripts often lack a good part of the recording, usually at either the beginning or end of chunk, mostly the end. And its mostly also the last chunk that becomes truncated. I.e the chunk that was sliced when manually stopped the recording. Trying to figure it out but cant find any solution.
Is this just a single AI run with one context window that can be exhausted? Or, are there internal techniques to instead split received audio and keep context down by doing similar to the 30 second operations on long audio as Whisper-1?
Does the intelligent AI get fed up with doing the work and emit a stop sequence instead?
It would seem that doing 80% of the job rather consistently as a fault would require observability of 80% of the job.
I have noticed the same patterns. For example, in a 10 minute conversation (ogg format, 3.4MB file), the content from the 10th minute in the transcript is missing. I can reproduce this for gpt-4o-transcribe and gpt-4o-mini-transcribe. However, whisper-1 does produce a complete transcript.
I have same exact issue but with pyaudio recordings, split every 60 sec. It’s not the duration - it’s the context window for me. Shifting even by a couple seconds makes gtp4-transcribe understand. Otherwise, it interprets only 1 sentence. Whisper works. Temperature won’t save anyone this time. Competition is coming, so they need to move beyond the transcribe model being just a Youtube learning pipe.
I’m experiencing the same issue. I use the model for voice input to my computer, and after switching from Whisper to gpt-4o-transcribe, I immediately noticed this problem (with German language input).
The last sentence is frequently missing from my transcriptions. Sometimes two sentences combined into one. I’ve found that it seems to help if you don’t end the recording immediately after speaking, but instead wait two to three seconds before stopping.
On the positive side, the new model definitely transcribes many technical terms correctly that were problematic before. For example, ChatGPT is no longer transcribed as “JetGPT,” and when speaking about OpenAI models, they’re correctly identified now.
Despite these improvements, the truncation issue is very frustrating. I hope OpenAI addresses this soon or provides API parameters that allow us to control this behavior.
Switched back to whisper. It leaves out whole senences all the time. Sometimes also sentences in the middle. Gpt-4o-transcribe is zero reliable and just unusable for me.