Description:
When using the gpt-4o-transcribe model for speech-to-text conversion, the model fails to properly handle paused audio. During pauses (e.g., silence or gaps in speech), the output becomes inconsistent, sometimes dropping segments, or only transcribing partial input.
Steps to Reproduce:
- Send an audio file/stream containing pauses (e.g., “Best places to visit in london…Best places to visit in uk”)
- Use the OpenAI API with
gpt-4o-transcribe(via cURL as per docs) - Observe inconsistent outputs such as:
- Dropped segments:
"Best places to visit in uk"(first part missing) - Partial segments:
"Best places to visit in london"(second part missing)
Expected Behavior:
The model should consistently transcribe the full audio input, including pauses, similar to Whisper’s output:
"Best places to visit in london Best places to visit in uk"
- Whisper model handles the same input correctly every time but not
gpt-4o-transcribe.