Hello,
I am currently developing a real-time open AI API for a standalone project for Quest 3 in Unreal Engine. I am using this plugin as a basis: OpenAI API Unreal. I have replaced the API with the latest real-time API. I have the following problem. The response times of the real-time API are sometimes very long. Does anyone have experience with this type of development and know which values need to be set so that the API responds quickly and reliably?
edit this is if you DONT have the UE source code ( which you can get for free through them via github HIGHLY recommend then you can use a short cut and skip the following outdated advice)
for the quest 3 part you using the realtime api, don’t call openai directly from the game client bro. quest 3 is a mobile chip, round tripping api calls from there is gonna be sluggish and unpredictable. you need a relay server sitting between UE and openai.
the flow looks like this: UE (Quest 3) → websocket → your server → OpenAI Realtime API → your server → UE
your server handles the openai connection (websocket or webrtc) and streams responses back to the game. that way the quest only has to maintain one fast persistent connection to your server instead of dealing with api auth, retries, and variable latency on device.
for the server even a basic node or python websocket proxy on a cheap vps works. if you need to scale later you can containerize it but don’t over-engineer it early.
also without something that sunder like 50ms you always gonna be sluggish make sure you’re using websocket or webrtc transport for the realtime api, not REST polling. openai’s time to first byte is around ~500ms under good conditions so with a proper streaming setup you can get voice to voice latency down to ~800ms which feels responsive in VR. fr fr
if you already doing that, then you need to change teh openai part cuz api → game is bad, api → server → orchestrator → game is good. the server relay pattern is what actually makes it usable, i was doing this last year for a similar xr project
if u need any info bro bro just @ me, i got all kinds of thangs ( i was also playing with MQ3 pc control in the options cuz i thought it would be cool to have a VR me)
Thank you for your help and tips. I am currently not using the source from Unreal, but from the Epic Games Launcher. In this case, what are the benefits of the source?
Currently, the web socket is executed directly by the plugin, so no separate web server is involved. As I am new to this topic, I have a few more questions for you. When I use the mobile Android app, the latency is relatively low and a Quest 3 has more power than my Android mobile phone, for example. How is this behaviour achieved?
Ue as a company wouldnt do this cuz thats risky for a general engine. But you or me or anyone with approval through their program [its free] can edit the sc. Even people on their git that may help you.
So why nats and faiss baked into ue source at the floor. Nats, you put a mesage broker straight into the engines networking layer. No http, no extra websocket overhead. Sub millisecond pub/sub between your ai servers, game clients, edge nodes, all of it. At the floor level the engine is the messaging infra. Your voice pipeline, telemetrey, game state, one event bus, no middleware no extra hops. Faiss, vector search running inside the engine loop. No calling out to some external vector db. Embeddings get queried locally at gpu speed, on device. Npc dialogue, spatial audio, rag lookups, all happening on the same gpu pipeline as rendering. Zero network latentcy. On quest 3 or phone this is huge. Together you offload the heavy ai inference to a server over nats, get results back in microsecnds, then faiss grounds the responses locally with relevent context. The device just renders. Same process is how you would get it on your phone [same arcitecture that makes the openai mobile app feel instant]. Except now with source access the engine handles the orchestation, not some bolted on plugin.
But the benifits are faster activity by alot. Also its the infra to offload gpu and reintroduce is simply rendered by the device.
You can take this even further when you teach a model unreal engine and train one. But you can cheat it by simply using clever engineering