Environment:
-
OpenAI Realtime API (gpt-realtime)
-
Twilio Media Streams integration
-
WebSocket connection (g711_ulaw audio format)
Issue:
I want my AI agent to say “Hello” → “Introduction + request details” in a Twilio phone call, but it actually says “Hello” → “Irrelevant content” → “Correct content only after user speaks”.
// Session configuration
const sessionConfig = {
type: 'session.update',
session: {
modalities: ['text', 'audio'],
instructions: `Phone agent. Start with "Hello" then immediately introduce and explain request.`,
voice: 'alloy',
input_audio_format: 'g711_ulaw',
output_audio_format: 'g711_ulaw',
turn_detection: {
type: 'server_vad',
threshold: 0.5,
prefix_padding_ms: 300,
silence_duration_ms: 500,
interrupt_response: true
},
temperature: 0.7,
max_response_output_tokens: 4096
}
};
// Staged speech control
setTimeout(() => {
const greetOnly = {
type: 'response.create',
response: {
conversation: 'none',
modalities: ['text', 'audio'],
instructions: 'Say only "Hello." within 0.5 seconds. Stay silent otherwise.'
}
};
openAIWs.send(JSON.stringify(greetOnly));
}, 800);
setTimeout(() => {
const startRequest = {
type: 'response.create',
response: {
modalities: ['text', 'audio'],
instructions: 'Say "Hello, I am calling on behalf of..." then explain the request.'
}
};
openAIWs.send(JSON.stringify(startRequest));
}, 1800);
Expected: Hello → (1sec gap) → Hello, I am calling on behalf of…
Actual: Hello → Irrelevant questions/speech → (User speaks) → Correct introduction
Questions:
-
Could sequential response.create calls be conflicting?
-
Does conversation: ‘none’ affect subsequent responses?
-
Is turn detection (silence_duration_ms: 500) interfering?
-
What’s the recommended way to reliably control staged speech?
Tried:
-
Timing adjustments (800ms, 1800ms, 2500ms, etc.)
-
More detailed instructions
-
Different conversation parameters
Any advice would be greatly appreciated!