### Issue Summary
I’m experiencing an issue with the OpenAI Realtime API where the `voice` setting (`marin`) is not consistently applied to audio responses, even though:
1. `session.update` is sent with `voice: ‘marin’` in `audio.output`
2. `session.updated` event is received and confirms `voice: ‘marin’`
3. `response.create` is sent **only after** receiving `session.updated`
4. `response.created` event confirms `voice: ‘marin’`
Despite all these confirmations, the actual audio output sometimes uses a male voice (likely `echo` or default voice) instead of the requested `marin` voice.
### Environment
- **API**: OpenAI Realtime API (`gpt-realtime-mini`)
- **Protocol**: WebSocket
- **Connection**: Server-to-server (Node.js application with Twilio Media Streams)
- **Voice requested**: `marin` (female voice)
- **Observed behavior**: Occasionally male voice is used instead
### Expected Behavior
When `voice: ‘marin’` is set in `session.update` and confirmed in `session.updated`, all subsequent audio responses should consistently use the `marin` voice.
### Actual Behavior
Audio responses sometimes use a male voice (not `marin`) despite:
- Correct `voice: ‘marin’` setting in `session.update`
- Confirmation of `voice: ‘marin’` in `session.updated` event
- Confirmation of `voice: ‘marin’` in `response.created` event
- `response.create` being sent **after** `session.updated` is received
### Code Implementation
**Session initialization (openaiHandler.js):**
```javascript
// 1. Wait for session.created
ws.on(‘open’, () => {
// Wait for session.created event…
});
// 2. Send session.update after session.created
if (response.type === ‘session.created’) {
const sessionUpdatePayload = {
type: 'session.update',
session: {
model: 'gpt-realtime-mini',
output_modalities: \['audio'\],
audio: {
output: {
format: { type: 'audio/pcm', rate: 24000 },
voice: 'marin' *// Explicitly set voice*
}
}
*// ... other settings*
}
};
ws.send(JSON.stringify(sessionUpdatePayload));
}
// 3. Wait for session.updated before sending response.create
if (response.type === ‘session.updated’) {
// Confirm voice setting
const voice = response.session.audio.output.voice;
console.log(‘Voice confirmed:’, voice); // Logs: “marin”
// Store in pendingSessionUpdated for twilioHandler
ws.pendingSessionUpdated = response;
resolve(ws);
}
```
**Response creation (twilioHandler.js):**
```javascript
// Only send response.create AFTER session.updated is received
openaiWs.onSessionUpdated = (response) => {
const responseCreatePayload = {
type: 'response.create',
response: {
output_modalities: \['audio'\],
instructions: 'Please greet in Japanese...',
audio: {
output: {
voice: 'marin' *// Also explicitly set in response.create*
}
}
}
};
openaiWs.send(JSON.stringify(responseCreatePayload));
};
// Check pendingSessionUpdated immediately
if (openaiWs.pendingSessionUpdated) {
openaiWs.onSessionUpdated(openaiWs.pendingSessionUpdated);
}
```
### Event Timeline (from logs)
```
16:04:30.993 - [OpenAI] session.created received
16:04:30.994 - [OpenAI] Sending session.update with voice: ‘marin’
16:04:31.279 - [OpenAI] session.updated received
16:04:31.279 - [OpenAI] Voice confirmed: marin
16:04:31.280 - [TwilioHandler] Sending response.create (1ms after session.updated)
16:04:31.458 - [OpenAI] response.created received
16:04:31.458 - [OpenAI] ✓ response.created voice confirmed: marin
```
**All events confirm `marin` voice**, but the actual audio is sometimes male.
### Steps to Reproduce
1. Connect to OpenAI Realtime API via WebSocket
2. Wait for `session.created` event
3. Send `session.update` with `audio.output.voice: ‘marin’`
4. Wait for `session.updated` event
5. Confirm `voice: ‘marin’` in `session.updated`
6. Send `response.create` with explicit `audio.output.voice: ‘marin’`
7. Observe `response.created` confirms `voice: ‘marin’`
8. Listen to the actual audio output
**Result:** Audio sometimes uses male voice instead of `marin`
### What I’ve Tried
1.
Explicitly setting `voice: ‘marin’` in `session.update`
2.
Explicitly setting `voice: ‘marin’` in `response.create`
3.
Waiting for `session.updated` before sending `response.create`
4.
Confirming voice in all events (`session.updated`, `response.created`)
5.
Ensuring correct event sequence: `session.created` → `session.update` → `session.updated` → `response.create`
6.
Adding delays between events (tested with 200ms delay)
7.
Checking for race conditions and timing issues
8.
Verifying audio conversion pipeline (µ-law ↔ PCM16) is not affecting voice
**None of these approaches resolved the issue consistently.**
### Additional Observations
1. The issue is **intermittent** - sometimes works correctly, sometimes doesn’t
2. All API events **confirm** `voice: ‘marin’` in logs
3. Pre-conversion audio analysis shows the male voice is coming **directly from OpenAI** (not a local conversion issue)
4. The problem occurs at the **first greeting** of a new session
5. Similar issues have been reported in the community (voice settings not being respected)
### Questions
1. Is there a known delay or caching issue where `voice` settings in `session.update` might not be immediately applied?
2. Should there be an additional waiting period after `session.updated` before sending `response.create`?
3. Is there a more reliable way to ensure voice settings are applied?
4. Are there any additional diagnostics or events I should be monitoring?
### Request
This is a critical issue for production voice applications. Could you please:
1. Confirm if this is a known bug
2. Provide a reliable workaround
3. Estimate when a fix might be available
### Environment Details
- **Node.js version**: 18.x
- **WebSocket library**: `ws` (npm package)
- **Audio format**: PCM16 24kHz (as specified by Realtime API)
- **Connection type**: Server-to-server (Twilio → Node.js → OpenAI)
- **Region**: Using default OpenAI endpoint
Thank you for your help!