Realtime transcription messages flow is wrong

I’m trying out the new audio transcriptions with the Realtime API but I’m experiencing something odd, the order of the messages seem to be wrong.
I’m getting all the deltas at once only after the person stops talking,
this is what I’m getting:

I start the session:
{“type”:“transcription_session.update”,“session”:{“input_audio_format”:“pcm16”,“input_audio_transcription”:{“model”:“gpt-4o-transcribe”},“turn_detection”:{“type”:“server_vad”,“threshold”:0.6,“prefix_padding_ms”:300,“silence_duration_ms”:650},“input_audio_noise_reduction”:{“type”:“near_field”}}}

get the replies:

{“type”:“transcription_session.created”, …}
{“type”:“transcription_session.updated”,…}

I start talking

{“type”:“input_audio_buffer.speech_started”,“event_id”:“event_BDUuWNyr4b7NQ3SIq2V5Y”,“audio_start_ms”:3892,“item_id”:“item_BDUuWWq02Qttz8THu5Y1E”}
{“type”:“input_audio_buffer.speech_stopped”,“event_id”:“event_BDUuadSIQY2ZTZ7RljCiJ”,"audio_end_ms":7136,“item_id”:“item_BDUuWWq02Qttz8THu5Y1E”}

Note that more than 3 seconds elapsed between speech_started and speech_stopped

Then I get:

{“type”:“input_audio_buffer.committed”,…}
{“type”:“conversation.item.created”,…}{“type”:“conversation.item.input_audio_transcription.delta”…}
{“type”:“conversation.item.input_audio_transcription.delta”…}
{“type”:“conversation.item.input_audio_transcription.delta”…}
{“type”:“conversation.item.input_audio_transcription.delta”…}
{“type”:“conversation.item.input_audio_transcription.delta”…}
{“type”:“conversation.item.input_audio_transcription.delta”…}

and finally
{“type”:“conversation.item.input_audio_transcription.completed”,…}

After the message input_audio_buffer.speech_stopped I get all the transcription deltas and the complete messages all at once. I would expect that the delta messages would appear between the speech_started and speech_stopped messages. I noticed a similar behavior in the Playground while testing the Realtime. Is this the expected behavior? I’d expect receiving the delta messages while the person is talking not after, it kind of misses its purpose.

Anyone else seeing this?

3 Likes

yes, also seeing the same, and I’m also surprised. No point getting all the deltas at the end, at the same time as the conversation.item.input_audio_transcription.completed event

Hi , here’s my functional code, hope it helps !

import os
import json
import base64
import asyncio
import logging
import aiohttp
import websockets
from dotenv import load_dotenv

load_dotenv()
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY")
if not OPENAI_API_KEY:
    raise ValueError("Missing OpenAI API key.")

logging.basicConfig(level=logging.DEBUG, format="%(asctime)s [%(levelname)s] %(message)s")
logger = logging.getLogger(__name__)

final_transcription = ""

async def create_transcription_session():
    """
    Create a transcription session via the REST API to obtain an ephemeral token.
    This endpoint uses the beta header "OpenAI-Beta: assistants=v2".
    """
    url = "https://api.openai.com/v1/realtime/transcription_sessions"
    payload = {
        "input_audio_format": "g711_ulaw",
        "input_audio_transcription": {
            "model": "gpt-4o-transcribe",
            "language": "en",
            "prompt": "Transcribe the incoming audio in real time."
        },
    
        "turn_detection": {"type": "server_vad", "silence_duration_ms": 1000}
    }
    headers = {
        "Authorization": f"Bearer {OPENAI_API_KEY}",
        "Content-Type": "application/json",
        "OpenAI-Beta": "assistants=v2"
    }
    async with aiohttp.ClientSession() as session:
        async with session.post(url, json=payload, headers=headers) as resp:
            if resp.status != 200:
                text = await resp.text()
                raise Exception(f"Failed to create transcription session: {resp.status} {text}")
            data = await resp.json()
            ephemeral_token = data["client_secret"]["value"]
            logger.info("Transcription session created; ephemeral token obtained.")
            return ephemeral_token

async def send_audio(ws, file_path: str, chunk_size: int, speech_stopped_event: asyncio.Event):
    """
    Read the local ulaw file and send it in chunks.
    After finishing, wait for 1 second to see if the server auto-commits.
    If not, send a commit event manually.
    """
    try:
        with open(file_path, "rb") as f:
            while True:
                chunk = f.read(chunk_size)
                if not chunk:
                    break
                # Base64-encode the audio chunk.
                audio_chunk = base64.b64encode(chunk).decode("utf-8")
                audio_event = {
                    "type": "input_audio_buffer.append",
                    "audio": audio_chunk
                }
                await ws.send(json.dumps(audio_event))
                await asyncio.sleep(0.02)  # simulate real-time streaming
        logger.info("Finished sending audio file.")

        # Wait 1 second to allow any late VAD events before committing.
        try:
            await asyncio.wait_for(speech_stopped_event.wait(), timeout=1.0)
            logger.debug("Speech stopped event received; no manual commit needed.")
        except asyncio.TimeoutError:
            commit_event = {"type": "input_audio_buffer.commit"}
            await ws.send(json.dumps(commit_event))
            logger.info("Manually sent input_audio_buffer.commit event.")
    except FileNotFoundError:
        logger.error(f"Audio file not found: {file_path}")
    except Exception as e:
        logger.error("Error sending audio: %s", e)

async def receive_events(ws, speech_stopped_event: asyncio.Event):
    """
    Listen for events from the realtime endpoint.
    Capture transcription deltas and the final complete transcription.
    Set the speech_stopped_event when a "speech_stopped" event is received.
    """
    global final_transcription
    try:
        async for message in ws:
            try:
                event = json.loads(message)
                event_type = event.get("type")
                if event_type == "input_audio_buffer.speech_stopped":
                    logger.debug("Received event: input_audio_buffer.speech_stopped")
                    speech_stopped_event.set()
                elif event_type == "conversation.item.input_audio_transcription.delta":
                    delta = event.get("delta", "")
                    logger.info("Transcription delta: %s", delta)
                    final_transcription += delta
                elif event_type == "conversation.item.input_audio_transcription.completed":
                    completed_text = event.get("transcript", "")
                    logger.info("Final transcription completed: %s", completed_text)
                    final_transcription = completed_text  # Use the completed transcript
                    break  # Exit after final transcription
                elif event_type == "error":
                    logger.error("Error event: %s", event.get("error"))
                else:
                    logger.debug("Received event: %s", event_type)
            except Exception as ex:
                logger.error("Error processing message: %s", ex)
    except Exception as e:
        logger.error("Error receiving events: %s", e)

async def test_transcription():
    try:
        # Step 1: Create transcription session and get ephemeral token.
        ephemeral_token = await create_transcription_session()

        # Step 2: Connect to the base realtime endpoint.
        websocket_url = "wss://api.openai.com/v1/realtime"
        connection_headers = {
            "Authorization": f"Bearer {ephemeral_token}",
            "OpenAI-Beta": "realtime=v1"
        }
        async with websockets.connect(websocket_url, additional_headers=connection_headers) as ws:
            logger.info("Connected to realtime endpoint.")

            # Step 3: Send transcription session update event with adjusted VAD settings.
            update_event = {
                "type": "transcription_session.update",
                "session": {
                    "input_audio_transcription": {
                        "model": "gpt-4o-transcribe",
                        "language": "en",
                        "prompt": "Transcribe the incoming audio in real time."
                    },
                    # Matching the REST API settings
                    "turn_detection": {"type": "server_vad", "silence_duration_ms": 1000}
                }
            }
            await ws.send(json.dumps(update_event))
            logger.info("Sent transcription session update event.")

            # Create an event to signal if speech stopped is detected.
            speech_stopped_event = asyncio.Event()

            # Step 4: Run sender and receiver concurrently.
            sender_task = asyncio.create_task(send_audio(ws, "static/Welcome.ulaw", 1024, speech_stopped_event))
            receiver_task = asyncio.create_task(receive_events(ws, speech_stopped_event))
            await asyncio.gather(sender_task, receiver_task)

            # Print the final transcription.
            logger.info("Final complete transcription: %s", final_transcription)
            print("Final complete transcription:")
            print(final_transcription)

    except Exception as e:
        logger.error("Error in transcription test: %s", e)

if __name__ == "__main__":
    asyncio.run(test_transcription())

I’m using Node.JS, it all works fine but the order of the messages is off. If I try the Realtime API through the Playground I get the same behavior, the UI doesn’t update while I’m speaking and only when I stop talking it updates the UI. Through the logs I can see the same behavior, the delta messages are only send after I stop talking

Any clarification on if this is a bug or expected behavior? I agree with the OP that receiving deltas and completed messages at the same time makes the deltas seem totally pointless.

I’m dealing with this same issue right now - in my app right now the delta messages are only coming after speech stops. Are we still waiting for resolution on this issue?

Unfortunately I’m starting to believe that this is the expected behavior. It’s exactly what happens when you use Realtime in the Playground. You only get the full transcript once you stop speaking, not while you’re speaking, and the logs also show that the deltas are only transmitted after you’re done speaking. I’m not really sure why they would implemented it like this, but it’s what we got right now