Realtime API - Unintended Speech Control with Sequential response.create

taketsuyo · September 22, 2025, 6:30am

Environment:

OpenAI Realtime API (gpt-realtime)
Twilio Media Streams integration
WebSocket connection (g711_ulaw audio format)

Issue:

I want my AI agent to say “Hello” → “Introduction + request details” in a Twilio phone call, but it actually says “Hello” → “Irrelevant content” → “Correct content only after user speaks”.

// Session configuration  
const sessionConfig = {
  type: 'session.update',
  session: {
    modalities: ['text', 'audio'],
    instructions: `Phone agent. Start with "Hello" then immediately introduce and explain request.`,
    voice: 'alloy',
    input_audio_format: 'g711_ulaw',
    output_audio_format: 'g711_ulaw',
    turn_detection: {
      type: 'server_vad',
      threshold: 0.5,
      prefix_padding_ms: 300,
      silence_duration_ms: 500,
      interrupt_response: true
    },
    temperature: 0.7,
    max_response_output_tokens: 4096
  }
};

// Staged speech control
setTimeout(() => {
  const greetOnly = {
    type: 'response.create',
    response: {
      conversation: 'none',
      modalities: ['text', 'audio'],
      instructions: 'Say only "Hello." within 0.5 seconds. Stay silent otherwise.'
    }
  };
  openAIWs.send(JSON.stringify(greetOnly));
}, 800);

setTimeout(() => {
  const startRequest = {
    type: 'response.create', 
    response: {
      modalities: ['text', 'audio'],
      instructions: 'Say "Hello, I am calling on behalf of..." then explain the request.'
    }
  };
  openAIWs.send(JSON.stringify(startRequest));
}, 1800);

Expected: Hello → (1sec gap) → Hello, I am calling on behalf of…

Actual: Hello → Irrelevant questions/speech → (User speaks) → Correct introduction

Questions:

Could sequential response.create calls be conflicting?
Does conversation: ‘none’ affect subsequent responses?
Is turn detection (silence_duration_ms: 500) interfering?
What’s the recommended way to reliably control staged speech?

Tried:

Timing adjustments (800ms, 1800ms, 2500ms, etc.)
More detailed instructions
Different conversation parameters

Any advice would be greatly appreciated!

juberti · September 22, 2025, 6:26pm

did you try putting all the instructions in the initial session.update and then omitting instructions from the response.create?

Topic		Replies	Views
Realtime api phone use case - speaking text Feedback assistants-api , realtime	16	1791	November 5, 2024
Prompts-Instructions for Realtime API API api-realtime	4	3427	November 4, 2024
Limit chatbot response to one line API	4	960	September 2, 2021
Even with “modalities” set to “text” only in Realtime API, Audio is occasionally generated Bugs realtime , api-realtime , api-realtime-speech	3	1209	November 29, 2024
Getting different responses from Playground and API? The AI answers its own questions API	20	3383	March 20, 2023

Realtime API - Unintended Speech Control with Sequential response.create

Related topics