Hi everyone,
I'm encountering a persistent issue using the Realtime API (`gpt-4o-realtime-preview`) via WebSockets from a native iOS Swift application using `URLSessionWebSocketTask`.
**The Problem:**
The WebSocket connection establishes successfully, and I receive the `session.created` event from the server (correctly indicating `input_audio_format: "pcm16"`). However, the connection is **immediately closed by the server (code 1000)** right after my client successfully sends the *very first* `input_audio_buffer.append` message containing the initial audio chunk. Subsequent send attempts or the receive loop then fail with "Socket is not connected".
**What Works:**
* Establishing the initial WebSocket connection with `Authorization: Bearer <KEY>` and `OpenAI-Beta: realtime=v1` headers.
* Receiving the `session.created` event.
* Capturing audio using `AVAudioEngine` and converting it to 16kHz, 16-bit Little Endian PCM `Data`.
* Sending the *first* `input_audio_buffer.append` event (logs confirm the JSON payload is sent).
**Log Snippet Showing the Failure Point:**
```swift
// ... Connection + session.created logs ...
Received server event type: session.created
Session created with ID: sess_BHFX0Gp6FLH1q4rEJbvq5. Starting audio IMMEDIATELY...
Starting audio engine and installing tap...
Audio engine started successfully.
Audio Tap: Received buffer with frameLength = 4800
Audio Tap: Converted buffer frameLength = 1600
pcmBufferToData: Called with buffer frameLength = 1600
pcmBufferToData: Returning data with count = 3200 // Correct: 1600 frames * 2 bytes/sample
Audio Tap: Queueing raw audio data (3200 bytes)
Timer: Sending dequeued audio chunk (3200 bytes)
sendAudioData: Raw data size = 3200
sendAudioData: Base64 data size = 4268 // Correct Base64 size
Sending client event: input_audio_buffer.append (5489 bytes) // First chunk sent successfully!
// --- IMMEDIATELY AFTER THIS ---
WebSocket Delegate: Did close with code 1000, reason:
nw_flow_add_write_request [...] cannot accept write requests
nw_write_request_report [...] Send failed with error "Socket is not connected"
WebSocket receive error after disconnect (expected): The operation couldn't be completed. Socket is not connected
// ... Subsequent sends/receives fail ...
Troubleshooting Steps Attempted (without success):
- Simplest Flow: Connect →
session.created
→ Immediately startAVAudioEngine
tap → Sendinput_audio_buffer.append
directly from tap. (Disconnects after first send). - Delay After
session.created
: Added a 100ms delay after receivingsession.created
before starting the audio engine/tap. (Still disconnects after first send). - Explicit Format Update: Sent a
session.update
confirminginput_audio_format: "pcm16"
aftersession.created
and before starting audio. (Disconnects right after thesession.update
or after the first audio packet). - Buffering/Rate Limiting: Implemented a queue and a
Timer
to send audio chunks every 100ms, decoupling sending from the audio tap callback. (Still disconnects after the first chunk is sent by the timer). - Audio Data Verification: Added detailed logging confirming audio format conversion (48k Float32 → 16k Int16 LE) and data sizes (raw bytes, base64 bytes, final JSON payload size) appear correct.
- Headers: Ensured correct
Authorization
andOpenAI-Beta
headers are used. Tried adding/removingSec-WebSocket-Protocol
(removing it fixed an initial beta header error). - API Key/Usage: Confirmed API key is valid, active, and well below usage limits.
Observations & Related Issues:
- The connection remains stable indefinitely if no client messages are sent after
session.created
. The closure is strictly triggered by the firstinput_audio_buffer.append
(orsession.update
). - This behavior mirrors issues reported in this thread: [Constantly disconnecting after session update with Realtime API], particularly
louzell
’s observation about direct client connections failing while proxied ones work, andaidanallchin
’s suggestion about timing/readiness after connection.
Questions:
- Is this a known issue or limitation when using
URLSessionWebSocketTask
directly from iOS with the Realtime API? - Is there a specific, undocumented state the server needs to be in after
session.created
before it can acceptinput_audio_buffer.append
events without closing the connection? - Could there be a very subtle byte-level encoding issue with the initial PCM data or Base64 encoding from Swift’s
Data
methods that the server rejects? - Are there any other non-obvious configuration steps or messages required when using WebSockets for audio streaming?
Any insights or suggestions would be greatly appreciated! The fallback to a simulation mode works, but I’d really like to get the live connection stable.
Thanks!```