Realtime API - Unintended Speech Control with Sequential response.create

Environment:

  • OpenAI Realtime API (gpt-realtime)

  • Twilio Media Streams integration

  • WebSocket connection (g711_ulaw audio format)

Issue:

I want my AI agent to say “Hello” → “Introduction + request details” in a Twilio phone call, but it actually says “Hello” → “Irrelevant content” → “Correct content only after user speaks”.

// Session configuration  
const sessionConfig = {
  type: 'session.update',
  session: {
    modalities: ['text', 'audio'],
    instructions: `Phone agent. Start with "Hello" then immediately introduce and explain request.`,
    voice: 'alloy',
    input_audio_format: 'g711_ulaw',
    output_audio_format: 'g711_ulaw',
    turn_detection: {
      type: 'server_vad',
      threshold: 0.5,
      prefix_padding_ms: 300,
      silence_duration_ms: 500,
      interrupt_response: true
    },
    temperature: 0.7,
    max_response_output_tokens: 4096
  }
};

// Staged speech control
setTimeout(() => {
  const greetOnly = {
    type: 'response.create',
    response: {
      conversation: 'none',
      modalities: ['text', 'audio'],
      instructions: 'Say only "Hello." within 0.5 seconds. Stay silent otherwise.'
    }
  };
  openAIWs.send(JSON.stringify(greetOnly));
}, 800);

setTimeout(() => {
  const startRequest = {
    type: 'response.create', 
    response: {
      modalities: ['text', 'audio'],
      instructions: 'Say "Hello, I am calling on behalf of..." then explain the request.'
    }
  };
  openAIWs.send(JSON.stringify(startRequest));
}, 1800);

Expected: Hello → (1sec gap) → Hello, I am calling on behalf of…

Actual: Hello → Irrelevant questions/speech → (User speaks) → Correct introduction

Questions:

  1. Could sequential response.create calls be conflicting?

  2. Does conversation: ‘none’ affect subsequent responses?

  3. Is turn detection (silence_duration_ms: 500) interfering?

  4. What’s the recommended way to reliably control staged speech?

Tried:

  • Timing adjustments (800ms, 1800ms, 2500ms, etc.)

  • More detailed instructions

  • Different conversation parameters

Any advice would be greatly appreciated!

did you try putting all the instructions in the initial session.update and then omitting instructions from the response.create?

1 Like