I’ve created a raw WS setup with nodeJs (system prompt ~1500 characters) and this definitely does not feel REALTIME. Tested it with text only and responses are coming back at around 350-800ms. Is everyone experiencing the same here or is it my setup? i am in a us-west region btw.
best latency at ~250ms when prompt is ~150 characters (which is not really useful)
It feels very realtime to me and I’m based in South Africa. Far away from any OpenAI servers. It’s so realtime that I’ve actually throttled back the VAD so that it doesn’t interrupt me too soon, as I’m still thinking.
"turn_detection": {
"type": "server_vad",
"threshold": 0.6,
"prefix_padding_ms": 500,
"silence_duration_ms": 2000,
},
[11:39:18.864] Assistant Event: session.created
[11:39:19.079] Assistant Event: session.updated
[11:39:19.229] Assistant Event: conversation.item.created
[11:39:19.232] Assistant Event: response.created
[11:39:19.600] Assistant Event: rate_limits.updated
[11:39:19.602] Assistant Event: response.output_item.added
[11:39:19.604] Assistant Event: conversation.item.created
[11:39:19.619] Assistant Event: response.content_part.added
[11:39:19.620] Assistant Event: response.audio_transcript.delta
[11:39:19.621] Assistant Transcription: general
[11:39:19.721] Assistant Event: response.audio.delta
[11:39:19.732] Audio delta written
[11:39:19.779] Assistant Event: response.audio.delta
[11:39:19.780] Audio delta written
[11:39:19.799] Assistant Event: response.audio.delta
[11:39:19.800] Audio delta written
[11:39:19.819] Assistant Event: response.audio.delta
[11:39:19.820] Audio delta written
[11:39:19.888] Assistant Event: response.audio.delta
[11:39:19.889] Audio delta written
[11:39:19.914] Assistant Event: response.audio.delta
[11:39:19.915] Audio delta written
[11:39:19.919] Assistant Event: response.audio.done
[11:39:19.920] Assistant Event: response.audio_transcript.done
[11:39:19.930] Assistant Event: response.content_part.done
[11:39:19.932] Assistant Event: response.output_item.done
[11:39:19.933] Assistant Event: response.done
Speech feels different. Humans are slow and expect an assistant to be at around their pace. Those number reflect actual events timing, which clearly shows how soon data flows back from the server.
When it’s the case of text processing that is tied up with other sequential tasks, a 100ms makes a big difference. So, when’s averaging above 350ms, that’s considered slow. This API is more on the soft Realtime and a bit of non-real time for convoluted requests.