Subject: iOS Swift Realtime API WebSocket Disconnects (Code 1000) Immediately After First Audio Packet

Hi everyone,

I'm encountering a persistent issue using the Realtime API (`gpt-4o-realtime-preview`) via WebSockets from a native iOS Swift application using `URLSessionWebSocketTask`.

**The Problem:**

The WebSocket connection establishes successfully, and I receive the `session.created` event from the server (correctly indicating `input_audio_format: "pcm16"`). However, the connection is **immediately closed by the server (code 1000)** right after my client successfully sends the *very first* `input_audio_buffer.append` message containing the initial audio chunk. Subsequent send attempts or the receive loop then fail with "Socket is not connected".

**What Works:**

*   Establishing the initial WebSocket connection with `Authorization: Bearer <KEY>` and `OpenAI-Beta: realtime=v1` headers.
*   Receiving the `session.created` event.
*   Capturing audio using `AVAudioEngine` and converting it to 16kHz, 16-bit Little Endian PCM `Data`.
*   Sending the *first* `input_audio_buffer.append` event (logs confirm the JSON payload is sent).

**Log Snippet Showing the Failure Point:**

```swift
// ... Connection + session.created logs ...
Received server event type: session.created
Session created with ID: sess_BHFX0Gp6FLH1q4rEJbvq5. Starting audio IMMEDIATELY...
Starting audio engine and installing tap...
Audio engine started successfully.
Audio Tap: Received buffer with frameLength = 4800
Audio Tap: Converted buffer frameLength = 1600
pcmBufferToData: Called with buffer frameLength = 1600
pcmBufferToData: Returning data with count = 3200 // Correct: 1600 frames * 2 bytes/sample
Audio Tap: Queueing raw audio data (3200 bytes)
Timer: Sending dequeued audio chunk (3200 bytes)
sendAudioData: Raw data size = 3200
sendAudioData: Base64 data size = 4268 // Correct Base64 size
Sending client event: input_audio_buffer.append (5489 bytes) // First chunk sent successfully!
// --- IMMEDIATELY AFTER THIS ---
WebSocket Delegate: Did close with code 1000, reason:
nw_flow_add_write_request [...] cannot accept write requests
nw_write_request_report [...] Send failed with error "Socket is not connected"
WebSocket receive error after disconnect (expected): The operation couldn't be completed. Socket is not connected
// ... Subsequent sends/receives fail ...

Troubleshooting Steps Attempted (without success):

  1. Simplest Flow: Connect → session.created → Immediately start AVAudioEngine tap → Send input_audio_buffer.append directly from tap. (Disconnects after first send).
  2. Delay After session.created: Added a 100ms delay after receiving session.created before starting the audio engine/tap. (Still disconnects after first send).
  3. Explicit Format Update: Sent a session.update confirming input_audio_format: "pcm16" after session.created and before starting audio. (Disconnects right after the session.update or after the first audio packet).
  4. Buffering/Rate Limiting: Implemented a queue and a Timer to send audio chunks every 100ms, decoupling sending from the audio tap callback. (Still disconnects after the first chunk is sent by the timer).
  5. Audio Data Verification: Added detailed logging confirming audio format conversion (48k Float32 → 16k Int16 LE) and data sizes (raw bytes, base64 bytes, final JSON payload size) appear correct.
  6. Headers: Ensured correct Authorization and OpenAI-Beta headers are used. Tried adding/removing Sec-WebSocket-Protocol (removing it fixed an initial beta header error).
  7. API Key/Usage: Confirmed API key is valid, active, and well below usage limits.

Observations & Related Issues:

  • The connection remains stable indefinitely if no client messages are sent after session.created. The closure is strictly triggered by the first input_audio_buffer.append (or session.update).
  • This behavior mirrors issues reported in this thread: [Constantly disconnecting after session update with Realtime API], particularly louzell’s observation about direct client connections failing while proxied ones work, and aidanallchin’s suggestion about timing/readiness after connection.

Questions:

  1. Is this a known issue or limitation when using URLSessionWebSocketTask directly from iOS with the Realtime API?
  2. Is there a specific, undocumented state the server needs to be in after session.created before it can accept input_audio_buffer.append events without closing the connection?
  3. Could there be a very subtle byte-level encoding issue with the initial PCM data or Base64 encoding from Swift’s Data methods that the server rejects?
  4. Are there any other non-obvious configuration steps or messages required when using WebSockets for audio streaming?

Any insights or suggestions would be greatly appreciated! The fallback to a simulation mode works, but I’d really like to get the live connection stable.

Thanks!```