Hi everyone,
I’ve spent a lot of time trying to build a reliable speech-to-text pipeline using OpenAI’s transcription models—both through the /v1/audio/transcriptions
endpoint and the new real-time /v1/realtime
WebSocket API (using the gpt-4o-transcribe
model). I’ve tested this through a custom browser-based web app with a direct WebSocket connection and a range of variations, including different chunk sizes, VAD settings, and silence durations.
Despite all this, I still consistently run into the same issue: a high frequency of truncated transcripts.
To clarify:
- The transcriptions I do get are high-quality and accurate.
- But large parts of the audio are simply missing from the final transcript.
- This occurs both for short clips (2–3 minutes) and for longer conversations.
- I use this in my work to transcribe real-time conversations between two people, so completeness is essential.
I’ve searched extensively online, including this forum, Reddit, GitHub, and developer blogs, but I haven’t found anyone who explicitly claims to have solved this issue 100%—as in, no truncation, ever, under realistic usage conditions.
So my question is:
Has anyone here successfully built a system using gpt-4o-transcribe
(especially over WebSocket in real-time) that consistently avoids truncation and always returns complete transcripts?
If so, I would deeply appreciate:
- A link to working code or an open-source repo
- Any insight into what might be causing the truncation
Thanks in advance to anyone who can help point me in the right direction. This has become a major blocker for real-world use, and it would be great to hear from someone who has managed to overcome it.