RealtimeAPI: WebRTC (Client) + WebSocket (Server) possible?

Hi! I’m currently working on a project where there’s a need to have the client (browser) connecting through WebRTC to RealtimeAPI and the backend to connect with WebSockets using the same sessionId to perform operations as the user talks in the client.

Is this something that can be done? I already managed to connect the client with WebRTC and the server with WS to the same sessionId as the client but the WS never gets events for “message” rather than the “session.created” one.

When you say server do you mean signaling server? Or whatever is hosting the AI as a client?

Hey mat.eo!

I mean an endpoint in the cloud that takes a sessionId param and stablishes a WS connection with that sessionId with Realtime.

So, the user will have the conversation in the browswer through WebRTC and the server in the cloud will be performing tasks according to the text transcripts from the WS connection to the same session.

I think you mean a relay server? This is absolutely possible. I’ve done this with my project (Though it is WSS (Client) - WSS (Relay Server) - WSS (OpenAI), I don’t use WebRTC but it should be possible as well).
You just pass all the raw base64 encoded audio frames to the relay server which then sends it to the OpenAI websocket.

Cheers! :hugs:

1 Like

I think this is about listening to the events from a connection between the user and the Realtime API service from the backend. This makes sense because, if frontend has the session token and with this token they can update the settings of the object created from backend, then they could abuse it without backend noticing it, no?

Agreed. A relay server would be the only option. You’d need a “dummy” client that sits in the WebRTC channel to collect and distribute all the packets of the data.

There’s something I don’t understand, if the server can never listen to the events of a session it created, wouldn’t this open vulnerabilities? For example:

  1. Client requests a session_id from server
  2. Server uses API key to create the session with a set of parameters
  3. Gives it back to client
  4. Client uses this session_id to abuse the session (e.g. increases the max tokens allowed and used model)
  5. Server cannot know this because there’s no way to listen to the session

I must be missing something :see_no_evil:

As far as I understand it, it technically would be listening.

All the information would be passed from the dummy client in the WebRTC channel to the WebSocket server. I’m not sure how plausible this whole scenario is (latency wise), but it’s an option

If there’s this dummy RTC, it’d be on backend and frontend would talk to this, no?

It would double the latency in theory. I don’t know if this is something that’s noticeable by the user, but it is twice as many events.

I feel like the obvious way to do this is to let frontend have a direct connection with the Realtime API with the session token, but restricting it from changing the settings or at least letting backend know if the session has been updated.

I don’t understand the point of having a session token pattern if we still have to connect backend directly, we could just use the API Auth token. That’s why I think I might be missing something.

1 Like

There is no traditional “back-end” for WebRTC besides usually server(s) that facilitate connection and sometimes media transfer. That’s the purpose of it: direct p2p communications. The OpenAI libraries (AFAIK) abstract away all the complexities of handling these protocols.

I’m approaching this question purely from a WebRTC → Websocket perspective, and not including the libraries that make it easier to manage, maybe that’s where the disconnect is happening?

I’m gathering that your reference to the “back-end” is the OpenAI server.

There is a expiration time associated with the ephermeral key. This means the user cannot run-away (forever) with examining the source in browser.

The fact that one cannot read whether the session has been updated or not kind of sucks. For the moment then, you can offer this service to someone you trust. After some time (weeks/months), I am relatively sure that OpenAI will give read level api access

There is no traditional “back-end” for WebRTC

I understand this, but traditional backend creates the session token (ephemeral key) and serves it to the frontend to connect to OpenAI’s WebRTC.

Sorry if I was not being clear. The way that I picture a real service using this is something like

  1. User wants to have a voice conversation in frontend
  2. Backend calls OpenAI’s POST and gets the sessionID with some settings
  3. Hands it back to the frontend
  4. Frontend connects directly to OpenAI’s Realtime API using the session token to establish a WebRTC connection

Then, the traditional backend that owns the OpenAI’s auth key doesn’t know how frontend is using this. I get it for privacy reasons, but I’d expect for backend to be able to tell if the settings of the session object are being changed, which is possible according to the docs. They could change things like that max output tokens, the instructions, model, and some other stuff. Is this less sensitive than I think?

Yeah, that’s my read as well. Thanks for taking the time to read and reply : )