Realtime Transcription mode Leaking System Prompt

Bug Report: Transcription Mode on Realtime API Leaking System Prompt

I’m experiencing a bug when using Transcription mode on the Realtime API.
Occasionally, I receive a conversation.item.input_audio_transcription.completed server event that includes part—or sometimes all—of the system prompt that was sent when creating the ephemeral key.

I’ve been able to replicate this mostly after long pauses in the audio input stream, though I’ve also observed it during normal sessions.

Has anyone else encountered a similar issue?


Current Setup

I’m creating my Realtime Agent and session via the Agents SDK (JavaScript), with a Python sideband server maintaining a websocket connection to the session and receiving the events.


Session Object (Ephemeral Key)

{
  "value": "ek_68*****************",
  "expires_at": 1760438160,
  "session": {
    "type": "transcription",
    "object": "realtime.transcription_session",
    "id": "sess_CQWW8DqHF1xWB0tImgwBY",
    "expires_at": 0,
    "audio": {
      "input": {
        "format": {
          "type": "audio/pcm",
          "rate": 24000
        },
        "transcription": {
          "model": "gpt-4o-mini-transcribe",
          "language": "en",
          "prompt": "# Car Insurance Call Transcriber – Starting Instructions\n\n## Persona\nYou are an AI transcriber whose sole task is to produce an accurate, verbatim transcript of a spoken call between a human customer and a human insurance handler.\nYou will only be listening to one side of the conversation, either the customer or human insurance handler. You will not hear both parties. \nAll utterances are in British English and pertain to car insurance.\n\n## Goals\n1. Capture every word, pause and filler (“erm”, “uh”, etc.) exactly as spoken.\n2. Preserve insurance-specific terminology (e.g. “comprehensive cover”, “no-claims bonus”, “excess”, “third-party liability”).\n3. Retain any relevant background noises or overlaps only if they’re audible and affect understanding.\n4. Do not summarise, paraphrase or correct grammar—transcribe strictly.\n\n## Transcription Corrections\nBelow is a list of transcriptions that you got wrong and have been updated by the human insurance handler. Use this information to not make the same mistakes.\n\n"
        },
        "noise_reduction": {
          "type": "far_field"
        },
        "turn_detection": {
          "type": "server_vad",
          "threshold": 0.7,
          "prefix_padding_ms": 300,
          "silence_duration_ms": 500,
          "idle_timeout_ms": null
        }
      }
    },
    "include": [
      "item.input_audio_transcription.logprobs"
    ]
  }
}

Agents SDK (JavaScript)

const handlerTransport = new OpenAIRealtimeWebRTC({
  mediaStream: handlerDest.stream,
  audioElement: handlerAudioEl,
});

const transcriptionAgent = new RealtimeAgent({
  name: "Transcription Agent",
  instructions: prompt, //same prompt as above
});

const handler = new RealtimeSession(transcriptionAgent, {
  transport: handlerTransport,
});

await handler.connect({ apiKey: handlerEphemeralObj.value });

Example of Incorrect Transcription Event

Sometimes only part of my prompt appears in the transcription output:

{
  "type": "conversation.item.input_audio_transcription.completed",
  "event_id": "event_CQWXSAHeIQZO0WmhP6PN8",
  "item_id": "item_CQWXRfPEVcd6xvFrezgWB",
  "content_index": 0,
  "transcript": "You are an AI transcriber whose sole task is to produce an accurate, verbatim transcript of a spoken call between a human customer and a human insurance handler.",
  "logprobs": [
    { "token": "You", "logprob": -1.017870306968689, "bytes": [89,111,117] },
    { "token": " are", "logprob": -3.0639405250549316, "bytes": [32,97,114,101] }
    // ... truncated for brevity
  ],
  "usage": {
    "type": "tokens",
    "total_tokens": 257,
    "input_tokens": 223,
    "input_token_details": {
      "text_tokens": 215,
      "audio_tokens": 8
    },
    "output_tokens": 34
  }
}

Summary

  • Issue: Prompt text leaking into transcriptions.
  • Conditions: Occurs after long silences or intermittently during active sessions.
  • Setup: JS Realtime Agent + Python sideband websocket listener.
  • Model: gpt-4o-mini-transcribe
  • Session Type: transcription

Update: Additional Session Configurations Tested

I’ve experimented with several different turn_detection configurations to see if the issue persists under alternate setups.
Unfortunately, the same prompt leakage occasionally occurs across all of the following variations:


1. Server VAD (No Response Creation)

"turn_detection": {
  "type": "server_vad",
  "create_response": false,
  "threshold": 0.7,
  "prefix_padding_ms": 300,
  "silence_duration_ms": 500,
  "idle_timeout_ms": null,
  "interrupt_response": false
}

2. Semantic VAD

"turn_detection": {
  "type": "semantic_vad",
  "create_response": false,
  "eagerness": "low",
  "interrupt_response": false
}

3. Semantic VAD (with Response Creation)

"turn_detection": {
  "type": "semantic_vad",
  "create_response": true,
  "eagerness": "low",
  "interrupt_response": true
}

@juberti, tagging you here to get your input/visibility on this. Thanks.

That is a somewhat long prompt. If you make the prompt shorter, does that help?

Hi thanks for the suggestion, i shortened my prompt to “All utterances are in British English and pertain to car insurance.” but eventually the api still transcribed it:

type: “conversation.item.input_audio_transcription.completed”,
event_id: “event_CREItiCjnfzPaDNrxXja5”,
item_id: “item_CREIqqydUOhfMM0T7tmao”,
content_index: 0,
transcript: “All utterances are in British English and pertain to car insurance.”,
logprobs: […

1 Like