GPT-4o-transcribe realtime, the .delta updates not received during the transcription

flesicek · September 9, 2025, 2:49am

We are testing GPT-4o-transcribe with websocket realtime transcription. Recognition is fine, but the conversation.item.input_audio_transcription.delta events are all received only on the end of turn as batch events all at once.

Is it possible to configure the transcription to receive the delta changes on the ongoing basis?

Could turn detection, or other configuration cause this “queuing” of messages?

{
  type: 'session.update',
  session: {
    type: 'transcription',
    audio: {
      input: {
        format: { rate: 24000, type: 'audio/pcm' },
        transcription: { model: transcriptionModel, language: isoLang },
        turn_detection: {
          create_response: true,
          type: 'server_vad',
          silence_duration_ms: EndSilenceTimeout,
        },
        noise_reduction: { type: 'near_field' },
      }
    },
    include: ["item.input_audio_transcription.logprobs",],
  }
 }

mathi · September 15, 2025, 5:35pm

Unfortunately, delta messages are not of much use with transcription. The endpoint starts the transcription when it receives the commit (either manual or auto generated in case of server VAS or semantic VAD). And when it receives the commit it transcribes the audio buffer in it’s entirety. So you get the complete message as well as all the Delta messages at the same time.

flesicek · September 15, 2025, 7:39pm

Hey Mathi, thanks for the feedback. If the transcription is done all at once during the commit, what is the purpose of the delta messages?

We are currently using “delta” messages from assembly ai and Azure speech to text to indicate to the user, when speech is being detected. Is there a way to receive some realtime data that we can use to trigger animation when ever speech is detected?

mathi · September 16, 2025, 5:27am

The way the delta messages work currently, I don’t think it is all that useful for transcription. They’re likely keeping the same API interaction pattern for other real-time endpoints (text, audio etc.), where delta messages are probably arriving earlier.

Soender · November 4, 2025, 8:30am

Facing the same issue. Everything described above by mathi is correct. A somewhat hacky solution is described below.
Setup;

OpenAI realtime with Azure (v1 endpoint), targeting gpt-4o-transcribe.
No server_vad enabled, as I have my own VAD (SileroV5). As mentioned, I must send “commit” messages myself.

Problem;
Delta’s are currently irrelevant. GPT-4o-transcribe, even with the Realtime endpoint, does not start transcription before the commit message is sent. Therefore, speech with no long pauses (that is, my VAD does not detect a turn change), results in query time for the transcription to exceed 5s (bad for real-time use cases).

I believe this is the same case, if one is using the server_vad option. That is, the server’s VAD does not recognize a turn change, and thus no automatic commit is sent before the audio buffer becomes large. [Correct me if wrong on this one, have not tested this exact case]

Depending on the usecase, this exact case of a user rambling for long periods of time, may be infrequent, so it might not be a problem for most developers.

Hacky solution;
Send intermediate commits. If a user is rambling for more than X seconds (no breaks in speech), send the commit message, restart the transcription, and repeat.
Moreover, the prompt can be changed during these stops and starts, to help guide the transcription (with previous transcriptions). Obviously, this will effect word error rates, but that is a compromise regarding the increased speed.

Finally, selecting X = 45, seems to keep the processing time below 2s.

mrcalavera · November 5, 2025, 4:10am

I’m also having the same issue, delta events are only arriving at the same time as the completed event.

Would be great if this could be fixed, it kind of defeats the purpose of realtime transcription for me…

Soender · December 9, 2025, 10:55am

Bumping this, did anybody figure out whether deltas can be received during transcription, or is this simply not a feature of Azure OAI?

Topic		Replies	Views
Realtime transcription messages flow is wrong Bugs transcribe , realtime	15	1560	August 8, 2025
Realtime streaming transcription API api-realtime	2	70	January 8, 2026
Discussion around syncing real-time AI-generated transcript deltas with WebRTC audio playback to ensure speech and on-screen text appear in natural alignment. API gpt-4 , chatgpt , api	2	315	September 9, 2025
Implementing gpt-realtime and gpt4-4o-transcribe for a streaming transcription API streaming , transcribe , gpt-realtime	9	1089	September 15, 2025
Realtime Transcription Mismatch and gpt 4o transcribe latest API realtime , api-realtime , api-realtime-speech	5	360	November 25, 2025

GPT-4o-transcribe realtime, the .delta updates not received during the transcription

Related topics