Persistent Truncation Issues with GPT-4o-Transcribe – Has Anyone Fully Solved This?

Hi everyone,

I’ve spent a lot of time trying to build a reliable speech-to-text pipeline using OpenAI’s transcription models—both through the /v1/audio/transcriptions endpoint and the new real-time /v1/realtime WebSocket API (using the gpt-4o-transcribe model). I’ve tested this through a custom browser-based web app with a direct WebSocket connection and a range of variations, including different chunk sizes, VAD settings, and silence durations.

Despite all this, I still consistently run into the same issue: a high frequency of truncated transcripts.

To clarify:

  • The transcriptions I do get are high-quality and accurate.
  • But large parts of the audio are simply missing from the final transcript.
  • This occurs both for short clips (2–3 minutes) and for longer conversations.
  • I use this in my work to transcribe real-time conversations between two people, so completeness is essential.

I’ve searched extensively online, including this forum, Reddit, GitHub, and developer blogs, but I haven’t found anyone who explicitly claims to have solved this issue 100%—as in, no truncation, ever, under realistic usage conditions.

So my question is:
Has anyone here successfully built a system using gpt-4o-transcribe (especially over WebSocket in real-time) that consistently avoids truncation and always returns complete transcripts?

If so, I would deeply appreciate:

  • A link to working code or an open-source repo
  • Any insight into what might be causing the truncation

Thanks in advance to anyone who can help point me in the right direction. This has become a major blocker for real-world use, and it would be great to hear from someone who has managed to overcome it.

3 Likes

Unfortunately this is a know issue, so far with no solution that I’ve heard of.

When I have sensitive content, I’m still using the whisper-1 model, that is slightly inferior but more resilient on truncation issues.

1 Like

I feel its abit weird. I havent seen any official statement from OpenAI commenting on the incomplete function of the gpt-4o-transribe. So far i have seen absolutelly zero people online, that has managed to get complete non-truncated transcripts using the gpt-4o-transcribe. The model came out a good few months ago now, and i havent seen any solution.

So yeah, I think I’ve found a solution to the truncation problem with the GPT-4o-transcribe model from OpenAI. Either my solution is the reason transcripts are no longer getting truncated, or OpenAI has made some changes to the model that fixed the issue. However, since I haven’t seen any official statement or announcement from OpenAI about this, I’m assuming the model is still the same as it was when it was released in March.

Anyway, I want to show you the simple solution I discovered. Previously, whenever I used GPT-4o-transcribe, I didn’t supply any prompt in the API call—I just used the default settings. But this time, I tried adding a prompt that specifically told the model not to truncate, omit, summarize, or clean up anything, and to transcribe every spoken word as accurately as possible.

After I did this, I noticed the truncation issue disappeared. The only downside was that the transcripts could be a little disorganized. To improve this, I set the temperature to 0.2. Since then, I’ve been getting really high-quality transcripts—honestly, they beat Whisper 1 almost every time.

I’ve done a lot of tests and experiments. I use this API for client consultations at work, and I’ve run both Whisper 1 and GPT-4o-transcribe in parallel. After these changes with the prompt and temperature, about 99% of the time, the GPT-4o-transcribe transcripts are much better quality than Whisper 1, and I no longer get incomplete or truncated transcripts.

Has anyone else tried this? Or were you aware of this workaround? Maybe give it a try and see if you get the same results. For a long time, I stuck with Whisper 1 because GPT-4o-transcribe had these issues, but now, with this custom prompt and temperature tweak, the problem seems totally solved on my end.

I’ll share the exact prompt I’m using below so you can try it out too. Let me know if it works for you or if you have any feedback!

"formData.append(“model”, “gpt-4o-transcribe”);
formData.append(“temperature”, “0.2”);
formData.append(“prompt”,
“Only transcribe spoken words; exclude all non-verbal and background noises.” +
“Do NOT omit, summarize, or “clean up” anything related to spoken words. " +
“Output every word as spoken. Do NOT truncate or leave out anything in the transcript, that is spoken”
);”

1 Like

@scott_no35 @aprendendo.next Interesting that you used the formData prompt option.

A few months ago, there were severe bugs in the TTS API (“gpt-4o-mini-tts”): sections of missing speech, volume fade-outs, slurred speech, other stuff… This obviously had a negative impact if fed back into STT. There were a bunch of threads complaining about the issues.

Last week, we conducted a series of tests (over a dozen) for TTS (“gpt-4o-mini-tts”) and SST (“gpt-4o-transcribe”) in different languages. Now, all the issues appear to be resolved - EXCEPT that the resulting transcripts were not properly formatted - just one big paragraph.

So, we are now passing transcription responses through an addithional api call (“gpt-4.1”):

Developer Prompt:

Identity

You are language expert. You specialize in formatting the unformatted text of any language.

Instructions

  • Determine logical paragraphs and seperate them with blank lines.
  • If a paragraph has a heading, insert a blank line between the heading and the paragraph.
  • If there is a title, insert a blank line after it.
  • Ensure that statements that need to be quoted are, in fact, quoted.
  • Ensure that the text is properly punctuated using the punctuation and grammatical rules of the language.

User Prompt:

Format the following text: The transcription response goes here

The results are outstanding, but expensive because of the additional api call.

1 Like

I assume that the shorter the audio chunks you send for transcription, the more expensive it becomes, since the prompt needs to be included with each chunk. For example, if you’re working with 3–5 second chunks, the prompt is repeated frequently, which adds up in cost.

In my case, I split recordings into two-minute chunks. The prompt I’m using (from my previous message) isn’t very long, so including it with each two-minute chunk shouldn’t significantly increase the cost.

Do you think OpenAI has made recent changes to the GPT-4o-transcribe model that resolved the truncation issue? Or do you think adding the prompt is what actually fixed it?

Ah, I forgot that you are doing it with the realtime WebSocket API - totally different animal… We are creating 3-15 minute MP3s without chunking which allows us to pass the transcription response to the formatting call.

Since you are creating chunks, I don’t know if the truncations have been fixed. That said, I think your formData prompt is a great idea.

Im not using the realtime API. Recording chunks are sent to the transcription endpoint at openai.

gpt-4o-transcribe was basically completelly useless for me before. These last months i have still just been using whisper-1. However, when i added that prompt with gpt-4o-transcribe, iv had zero truncation.

Is there a reason why you are sending chunks? You can transcribe a recording up to about 15 minutes long with no issue - you would have to chunk anything much over 15 minutes. To be honest, I have NO idea how to effectively chunk a recording, so my hat is off to you. :slightly_smiling_face:

In a perfect world, OpenAI should allow TTS for entire documents with no limits. And also allow STT with no limits (i.e. no chunking) since MP3s can be very large.

The reason I’m sending two-minute chunks is because I’m using speech-to-text for my job, which involves client consultations with a lot of back-and-forth conversation. I handle many consultations each day, and each one typically lasts around 15–20 minutes. Since I need to stay on schedule, I want the transcript ready as quickly as possible so I can make my notes immediately after each session.

To speed up the process, I split the continuous audio into two-minute segments. As soon as a two-minute mark is reached, that segment is sent for transcription, and the result appears automatically in my transcription output field. This approach means that by the time I finish a consultation—say, at the 20-minute mark—about 80 to 90% of the transcription is already completed. In contrast, sending the entire 20-minute audio file at once would result in a longer wait time before I receive the full transcript.

Another reason I prefer using two-minute chunks instead of a real-time transcription API is that each chunk provides the model with more context. Since the model receives a full two minutes of audio at once, it can better understand the flow of conversation and speaker intent, which leads to higher transcription accuracy. Real-time transcription, on the other hand, typically works with much shorter buffers and less overall context, which can result in lower-quality output.

1 Like

Your use case is probably one of the most important and widely used cases for STT.

1 Like