Voice setting not applied consistently despite correct session.update sequence

### Issue Summary

I’m experiencing an issue with the OpenAI Realtime API where the `voice` setting (`marin`) is not consistently applied to audio responses, even though:

1. `session.update` is sent with `voice: ‘marin’` in `audio.output`

2. `session.updated` event is received and confirms `voice: ‘marin’`

3. `response.create` is sent **only after** receiving `session.updated`

4. `response.created` event confirms `voice: ‘marin’`

Despite all these confirmations, the actual audio output sometimes uses a male voice (likely `echo` or default voice) instead of the requested `marin` voice.

### Environment

- **API**: OpenAI Realtime API (`gpt-realtime-mini`)

- **Protocol**: WebSocket

- **Connection**: Server-to-server (Node.js application with Twilio Media Streams)

- **Voice requested**: `marin` (female voice)

- **Observed behavior**: Occasionally male voice is used instead

### Expected Behavior

When `voice: ‘marin’` is set in `session.update` and confirmed in `session.updated`, all subsequent audio responses should consistently use the `marin` voice.

### Actual Behavior

Audio responses sometimes use a male voice (not `marin`) despite:

- Correct `voice: ‘marin’` setting in `session.update`

- Confirmation of `voice: ‘marin’` in `session.updated` event

- Confirmation of `voice: ‘marin’` in `response.created` event

- `response.create` being sent **after** `session.updated` is received

### Code Implementation

**Session initialization (openaiHandler.js):**

```javascript

// 1. Wait for session.created

ws.on(‘open’, () => {

// Wait for session.created event…

});

// 2. Send session.update after session.created

if (response.type === ‘session.created’) {

const sessionUpdatePayload = {

type: 'session.update',

session: {

  model: 'gpt-realtime-mini',

  output_modalities: \['audio'\],

  audio: {

    output: {

      format: { type: 'audio/pcm', rate: 24000 },

      voice: 'marin'  *// Explicitly set voice*

    }

  }

  *// ... other settings*

}

};

ws.send(JSON.stringify(sessionUpdatePayload));

}

// 3. Wait for session.updated before sending response.create

if (response.type === ‘session.updated’) {

// Confirm voice setting

const voice = response.session.audio.output.voice;

console.log(‘Voice confirmed:’, voice); // Logs: “marin”

// Store in pendingSessionUpdated for twilioHandler

ws.pendingSessionUpdated = response;

resolve(ws);

}

```

**Response creation (twilioHandler.js):**

```javascript

// Only send response.create AFTER session.updated is received

openaiWs.onSessionUpdated = (response) => {

const responseCreatePayload = {

type: 'response.create',

response: {

  output_modalities: \['audio'\],

  instructions: 'Please greet in Japanese...',

  audio: {

    output: {

      voice: 'marin'  *// Also explicitly set in response.create*

    }

  }

}

};

openaiWs.send(JSON.stringify(responseCreatePayload));

};

// Check pendingSessionUpdated immediately

if (openaiWs.pendingSessionUpdated) {

openaiWs.onSessionUpdated(openaiWs.pendingSessionUpdated);

}

```

### Event Timeline (from logs)

```

16:04:30.993 - [OpenAI] session.created received

16:04:30.994 - [OpenAI] Sending session.update with voice: ‘marin’

16:04:31.279 - [OpenAI] session.updated received

16:04:31.279 - [OpenAI] Voice confirmed: marin

16:04:31.280 - [TwilioHandler] Sending response.create (1ms after session.updated)

16:04:31.458 - [OpenAI] response.created received

16:04:31.458 - [OpenAI] ✓ response.created voice confirmed: marin

```

**All events confirm `marin` voice**, but the actual audio is sometimes male.

### Steps to Reproduce

1. Connect to OpenAI Realtime API via WebSocket

2. Wait for `session.created` event

3. Send `session.update` with `audio.output.voice: ‘marin’`

4. Wait for `session.updated` event

5. Confirm `voice: ‘marin’` in `session.updated`

6. Send `response.create` with explicit `audio.output.voice: ‘marin’`

7. Observe `response.created` confirms `voice: ‘marin’`

8. Listen to the actual audio output

**Result:** Audio sometimes uses male voice instead of `marin`

### What I’ve Tried

1. :white_check_mark: Explicitly setting `voice: ‘marin’` in `session.update`

2. :white_check_mark: Explicitly setting `voice: ‘marin’` in `response.create`

3. :white_check_mark: Waiting for `session.updated` before sending `response.create`

4. :white_check_mark: Confirming voice in all events (`session.updated`, `response.created`)

5. :white_check_mark: Ensuring correct event sequence: `session.created` → `session.update` → `session.updated` → `response.create`

6. :white_check_mark: Adding delays between events (tested with 200ms delay)

7. :white_check_mark: Checking for race conditions and timing issues

8. :white_check_mark: Verifying audio conversion pipeline (µ-law ↔ PCM16) is not affecting voice

**None of these approaches resolved the issue consistently.**

### Additional Observations

1. The issue is **intermittent** - sometimes works correctly, sometimes doesn’t

2. All API events **confirm** `voice: ‘marin’` in logs

3. Pre-conversion audio analysis shows the male voice is coming **directly from OpenAI** (not a local conversion issue)

4. The problem occurs at the **first greeting** of a new session

5. Similar issues have been reported in the community (voice settings not being respected)

### Questions

1. Is there a known delay or caching issue where `voice` settings in `session.update` might not be immediately applied?

2. Should there be an additional waiting period after `session.updated` before sending `response.create`?

3. Is there a more reliable way to ensure voice settings are applied?

4. Are there any additional diagnostics or events I should be monitoring?

### Request

This is a critical issue for production voice applications. Could you please:

1. Confirm if this is a known bug

2. Provide a reliable workaround

3. Estimate when a fix might be available

### Environment Details

- **Node.js version**: 18.x

- **WebSocket library**: `ws` (npm package)

- **Audio format**: PCM16 24kHz (as specified by Realtime API)

- **Connection type**: Server-to-server (Twilio → Node.js → OpenAI)

- **Region**: Using default OpenAI endpoint

Thank you for your help!