Realtime transcription messages flow is wrong

paulotaylor · March 21, 2025, 12:32pm

I’m trying out the new audio transcriptions with the Realtime API but I’m experiencing something odd, the order of the messages seem to be wrong.
I’m getting all the deltas at once only after the person stops talking,
this is what I’m getting:

I start the session:
{“type”:“transcription_session.update”,“session”:{“input_audio_format”:“pcm16”,“input_audio_transcription”:{“model”:“gpt-4o-transcribe”},“turn_detection”:{“type”:“server_vad”,“threshold”:0.6,“prefix_padding_ms”:300,“silence_duration_ms”:650},“input_audio_noise_reduction”:{“type”:“near_field”}}}

get the replies:

{“type”:“transcription_session.created”, …}
{“type”:“transcription_session.updated”,…}

I start talking

{“type”:“input_audio_buffer.speech_started”,“event_id”:“event_BDUuWNyr4b7NQ3SIq2V5Y”,“audio_start_ms”:3892,“item_id”:“item_BDUuWWq02Qttz8THu5Y1E”}
{“type”:“input_audio_buffer.speech_stopped”,“event_id”:“event_BDUuadSIQY2ZTZ7RljCiJ”,"audio_end_ms":7136,“item_id”:“item_BDUuWWq02Qttz8THu5Y1E”}

Note that more than 3 seconds elapsed between speech_started and speech_stopped

Then I get:

{“type”:“input_audio_buffer.committed”,…}
{“type”:“conversation.item.created”,…}{“type”:“conversation.item.input_audio_transcription.delta”…}
{“type”:“conversation.item.input_audio_transcription.delta”…}
{“type”:“conversation.item.input_audio_transcription.delta”…}
{“type”:“conversation.item.input_audio_transcription.delta”…}
{“type”:“conversation.item.input_audio_transcription.delta”…}
{“type”:“conversation.item.input_audio_transcription.delta”…}
…
and finally
{“type”:“conversation.item.input_audio_transcription.completed”,…}

After the message input_audio_buffer.speech_stopped I get all the transcription deltas and the complete messages all at once. I would expect that the delta messages would appear between the speech_started and speech_stopped messages. I noticed a similar behavior in the Playground while testing the Realtime. Is this the expected behavior? I’d expect receiving the delta messages while the person is talking not after, it kind of misses its purpose.

Anyone else seeing this?

small.flame0917 · March 25, 2025, 9:21am

yes, also seeing the same, and I’m also surprised. No point getting all the deltas at the end, at the same time as the conversation.item.input_audio_transcription.completed event

TonyStark · March 29, 2025, 8:21pm

Hi , here’s my functional code, hope it helps !

import os
import json
import base64
import asyncio
import logging
import aiohttp
import websockets
from dotenv import load_dotenv

load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
if not OPENAI_API_KEY:
    raise ValueError("Missing OpenAI API key.")

logging.basicConfig(level=logging.DEBUG, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger(__name__)

final_transcription = ""

async def create_transcription_session():
    """
    Create a transcription session via the REST API to obtain an ephemeral token.
    This endpoint uses the beta header "OpenAI-Beta: assistants=v2".
    """
    url = "https://api.openai.com/v1/realtime/transcription_sessions"
    payload = {
        "input_audio_format": "g711_ulaw",
        "input_audio_transcription": {
            "model": "gpt-4o-transcribe",
            "language": "en",
            "prompt": "Transcribe the incoming audio in real time."
        },
    
        "turn_detection": {"type": "server_vad", "silence_duration_ms": 1000}
    }
    headers = {
        "Authorization": f"Bearer {OPENAI_API_KEY}",
        "Content-Type": "application/json",
        "OpenAI-Beta": "assistants=v2"
    }
    async with aiohttp.ClientSession() as session:
        async with session.post(url, json=payload, headers=headers) as resp:
            if resp.status != 200:
                text = await resp.text()
                raise Exception(f"Failed to create transcription session: {resp.status} {text}")
            data = await resp.json()
            ephemeral_token = data["client_secret"]["value"]
            logger.info("Transcription session created; ephemeral token obtained.")
            return ephemeral_token

async def send_audio(ws, file_path: str, chunk_size: int, speech_stopped_event: asyncio.Event):
    """
    Read the local ulaw file and send it in chunks.
    After finishing, wait for 1 second to see if the server auto-commits.
    If not, send a commit event manually.
    """
    try:
        with open(file_path, "rb") as f:
            while True:
                chunk = f.read(chunk_size)
                if not chunk:
                    break
                # Base64-encode the audio chunk.
                audio_chunk = base64.b64encode(chunk).decode("utf-8")
                audio_event = {
                    "type": "input_audio_buffer.append",
                    "audio": audio_chunk
                }
                await ws.send(json.dumps(audio_event))
                await asyncio.sleep(0.02)  # simulate real-time streaming
        logger.info("Finished sending audio file.")

        # Wait 1 second to allow any late VAD events before committing.
        try:
            await asyncio.wait_for(speech_stopped_event.wait(), timeout=1.0)
            logger.debug("Speech stopped event received; no manual commit needed.")
        except asyncio.TimeoutError:
            commit_event = {"type": "input_audio_buffer.commit"}
            await ws.send(json.dumps(commit_event))
            logger.info("Manually sent input_audio_buffer.commit event.")
    except FileNotFoundError:
        logger.error(f"Audio file not found: {file_path}")
    except Exception as e:
        logger.error("Error sending audio: %s", e)

async def receive_events(ws, speech_stopped_event: asyncio.Event):
    """
    Listen for events from the realtime endpoint.
    Capture transcription deltas and the final complete transcription.
    Set the speech_stopped_event when a "speech_stopped" event is received.
    """
    global final_transcription
    try:
        async for message in ws:
            try:
                event = json.loads(message)
                event_type = event.get("type")
                if event_type == "input_audio_buffer.speech_stopped":
                    logger.debug("Received event: input_audio_buffer.speech_stopped")
                    speech_stopped_event.set()
                elif event_type == "conversation.item.input_audio_transcription.delta":
                    delta = event.get("delta", "")
                    logger.info("Transcription delta: %s", delta)
                    final_transcription += delta
                elif event_type == "conversation.item.input_audio_transcription.completed":
                    completed_text = event.get("transcript", "")
                    logger.info("Final transcription completed: %s", completed_text)
                    final_transcription = completed_text  # Use the completed transcript
                    break  # Exit after final transcription
                elif event_type == "error":
                    logger.error("Error event: %s", event.get("error"))
                else:
                    logger.debug("Received event: %s", event_type)
            except Exception as ex:
                logger.error("Error processing message: %s", ex)
    except Exception as e:
        logger.error("Error receiving events: %s", e)

async def test_transcription():
    try:
        # Step 1: Create transcription session and get ephemeral token.
        ephemeral_token = await create_transcription_session()

        # Step 2: Connect to the base realtime endpoint.
        websocket_url = "wss://api.openai.com/v1/realtime"
        connection_headers = {
            "Authorization": f"Bearer {ephemeral_token}",
            "OpenAI-Beta": "realtime=v1"
        }
        async with websockets.connect(websocket_url, additional_headers=connection_headers) as ws:
            logger.info("Connected to realtime endpoint.")

            # Step 3: Send transcription session update event with adjusted VAD settings.
            update_event = {
                "type": "transcription_session.update",
                "session": {
                    "input_audio_transcription": {
                        "model": "gpt-4o-transcribe",
                        "language": "en",
                        "prompt": "Transcribe the incoming audio in real time."
                    },
                    # Matching the REST API settings
                    "turn_detection": {"type": "server_vad", "silence_duration_ms": 1000}
                }
            }
            await ws.send(json.dumps(update_event))
            logger.info("Sent transcription session update event.")

            # Create an event to signal if speech stopped is detected.
            speech_stopped_event = asyncio.Event()

            # Step 4: Run sender and receiver concurrently.
            sender_task = asyncio.create_task(send_audio(ws, "static/Welcome.ulaw", 1024, speech_stopped_event))
            receiver_task = asyncio.create_task(receive_events(ws, speech_stopped_event))
            await asyncio.gather(sender_task, receiver_task)

            # Print the final transcription.
            logger.info("Final complete transcription: %s", final_transcription)
            print("Final complete transcription:")
            print(final_transcription)

    except Exception as e:
        logger.error("Error in transcription test: %s", e)

if __name__ == "__main__":
    asyncio.run(test_transcription())

paulotaylor · March 30, 2025, 12:36am

I’m using Node.JS, it all works fine but the order of the messages is off. If I try the Realtime API through the Playground I get the same behavior, the UI doesn’t update while I’m speaking and only when I stop talking it updates the UI. Through the logs I can see the same behavior, the delta messages are only send after I stop talking

saint_cookies · April 19, 2025, 6:14pm

Any clarification on if this is a bug or expected behavior? I agree with the OP that receiving deltas and completed messages at the same time makes the deltas seem totally pointless.

mrsoftee · April 23, 2025, 7:12pm

I’m dealing with this same issue right now - in my app right now the delta messages are only coming after speech stops. Are we still waiting for resolution on this issue?

paulotaylor · April 23, 2025, 9:34pm

Unfortunately I’m starting to believe that this is the expected behavior. It’s exactly what happens when you use Realtime in the Playground. You only get the full transcript once you stop speaking, not while you’re speaking, and the logs also show that the deltas are only transmitted after you’re done speaking. I’m not really sure why they would implemented it like this, but it’s what we got right now

Nikitarex · May 9, 2025, 8:23am

I’m having the same issue as of now, i hope they fix it or at least provide some better explanation.

wnmills3 · May 9, 2025, 4:42pm

My guess is this depends on the capabilities of the speech processing model and whether it can stream its output while receiving input, or whether it needs to receive the entire utterance as a single audio buffer before it can begin processing (e.g., the VAD has determined the input speech has ended).

NickR · June 6, 2025, 4:49pm

I’m experiencing the same issue, the deltas are not sent until the end of a speech buffer is detected

joern.loviscach · July 6, 2025, 10:02am

This bug (which kills my intended app) is still present, at least when using WebRTC. For reference, the documentation says:

For whisper-1 the delta event will contain full turn transcript, same as completed event. For gpt-4o-transcribe and gpt-4o-mini-transcribe the delta event will contain incremental transcripts as they are streamed out from the model.

Let me repeat: “as they are streamed out from the model”

Same bug reported elsewhere.

mathi · July 13, 2025, 3:50am

From https://platform.openai.com/docs/guides/realtime-transcription

The Realtime API supports automatic voice activity detection (VAD). Enabled by default, VAD will control when the input audio buffer is committed, therefore when transcription begins.

It appears that if you use ServerVAD, they will start transcription only when the audio input is committed, which happens if there is a silence during the speech. For a 3s speech, I’m assuming there wasn’t much silence during the speech.

You may have better luck changing turn_detection.type to semantic VAD and setting eagerness to high. You might want to give it a try.

If you want the model to respond more often in conversation mode, or to return transcription events faster in transcription mode, you can set eagerness to high.

I am building an application that uses transcription. I am also struggling with decoding the recommended flow in the documentation.

_j · July 13, 2025, 10:30pm

Splitting by VAD using the newer parameters was very error prone for me. I abandoned the idea. Symptoms include:

no transcript
repeating the prompt instead of the voice
hallucination

If the chance of gpt-4o transcription going wrong is 5% (which is being generous), you multiply that by every pause that could be encountered.

Then also having short snippets of audio instead of a good context to what is being spoken degrades the transcription. That my own text prompt is continually reused instead of a chain of previous transcripts is also a major fault.

bhorseman · July 23, 2025, 6:38am

Still experiencing this issue as of July 22nd 2025. This seems like a fundamental streaming capability. Can the OpenAI engineers look at this? No way this in intended behavior.

Topic		Replies	Views
Semantic VAD might not be working with transcription mode API	9	670	July 22, 2025
Realtime/transcription_sessions API returns 401 even when adding ephemeral key API transcribe , realtime	8	409	April 7, 2025
[Realtime API] Audio is randomly cutting off at the end Bugs realtime	82	5528	July 1, 2025
New audio models in the API + tools for voice agents Announcements	28	4922	July 13, 2025
Issues with Transcription in Realtime Model Using WebRTC Bugs realtime	15	1359	April 30, 2025

Realtime transcription messages flow is wrong

Related topics