Hi! I’m currently working on a project where there’s a need to have the client (browser) connecting through WebRTC to RealtimeAPI and the backend to connect with WebSockets using the same sessionId to perform operations as the user talks in the client.
Is this something that can be done? I already managed to connect the client with WebRTC and the server with WS to the same sessionId as the client but the WS never gets events for “message” rather than the “session.created” one.
I mean an endpoint in the cloud that takes a sessionId param and stablishes a WS connection with that sessionId with Realtime.
So, the user will have the conversation in the browswer through WebRTC and the server in the cloud will be performing tasks according to the text transcripts from the WS connection to the same session.
I think you mean a relay server? This is absolutely possible. I’ve done this with my project (Though it is WSS (Client) - WSS (Relay Server) - WSS (OpenAI), I don’t use WebRTC but it should be possible as well).
You just pass all the raw base64 encoded audio frames to the relay server which then sends it to the OpenAI websocket.
I think this is about listening to the events from a connection between the user and the Realtime API service from the backend. This makes sense because, if frontend has the session token and with this token they can update the settings of the object created from backend, then they could abuse it without backend noticing it, no?
Agreed. A relay server would be the only option. You’d need a “dummy” client that sits in the WebRTC channel to collect and distribute all the packets of the data.
There’s something I don’t understand, if the server can never listen to the events of a session it created, wouldn’t this open vulnerabilities? For example:
Client requests a session_id from server
Server uses API key to create the session with a set of parameters
Gives it back to client
Client uses this session_id to abuse the session (e.g. increases the max tokens allowed and used model)
Server cannot know this because there’s no way to listen to the session
As far as I understand it, it technically would be listening.
All the information would be passed from the dummy client in the WebRTC channel to the WebSocket server. I’m not sure how plausible this whole scenario is (latency wise), but it’s an option
If there’s this dummy RTC, it’d be on backend and frontend would talk to this, no?
It would double the latency in theory. I don’t know if this is something that’s noticeable by the user, but it is twice as many events.
I feel like the obvious way to do this is to let frontend have a direct connection with the Realtime API with the session token, but restricting it from changing the settings or at least letting backend know if the session has been updated.
I don’t understand the point of having a session token pattern if we still have to connect backend directly, we could just use the API Auth token. That’s why I think I might be missing something.
There is no traditional “back-end” for WebRTC besides usually server(s) that facilitate connection and sometimes media transfer. That’s the purpose of it: direct p2p communications. The OpenAI libraries (AFAIK) abstract away all the complexities of handling these protocols.
I’m approaching this question purely from a WebRTC → Websocket perspective, and not including the libraries that make it easier to manage, maybe that’s where the disconnect is happening?
I’m gathering that your reference to the “back-end” is the OpenAI server.
There is a expiration time associated with the ephermeral key. This means the user cannot run-away (forever) with examining the source in browser.
The fact that one cannot read whether the session has been updated or not kind of sucks. For the moment then, you can offer this service to someone you trust. After some time (weeks/months), I am relatively sure that OpenAI will give read level api access
I understand this, but traditional backend creates the session token (ephemeral key) and serves it to the frontend to connect to OpenAI’s WebRTC.
Sorry if I was not being clear. The way that I picture a real service using this is something like
User wants to have a voice conversation in frontend
Backend calls OpenAI’s POST and gets the sessionID with some settings
Hands it back to the frontend
Frontend connects directly to OpenAI’s Realtime API using the session token to establish a WebRTC connection
Then, the traditional backend that owns the OpenAI’s auth key doesn’t know how frontend is using this. I get it for privacy reasons, but I’d expect for backend to be able to tell if the settings of the session object are being changed, which is possible according to the docs. They could change things like that max output tokens, the instructions, model, and some other stuff. Is this less sensitive than I think?