I’m building a project with the use of the OpenAI Realtime API via WebRTC, and it made me wonder about the security practices here.
Based on the official docs, the flow is the following:
UI → Server: request ephemeral key
Server → OpenAI: https://api.openai.com/v1/realtime/sessions creates a session and receives the ephemeral key back alongside session params like ID, etc.
Server → UI: returns ephemeral key
UI → OpenAI: https://api.openai.com/v1/realtime?model=${model} begins a session with the use of the ephemeral key
UI → OpenAI via WebRTC: established a peer connection and proceeds with the conversation
Once the conversation is over, the UI closes the peer connection, so this way the Realtime session is finished.
There are a few things that worry me:
there is no way for the Server to stop the conversation by sending a request with the session id to OpenAI. This is useful, if we want to restrict the duration of the conversation server-side (let’s say, based on the user “credits”), as client-side restrictions are easy to bypass
there is no way to get the session metadata (like duration, costs, status, etc.) from the Server by session id. This is useful to prevent the misuse of the conversations, which is easy to achieve by sniffing the ephemeral key that comes from the Server, and using it outside the UI app. Also, if the service charges their users based on the conversation duration / used tokens, there is no way to learn the exact numbers for the particular session, which opens doors for the misuse.
I understand that it is possible to fix that by using WebSockets on the server side, but it brings its own implementation complexity, and additional network traffic costs, as cloud providers charge for the in/outbound traffic, and with audio format, it is quite noticeable.
A question to the fellow developers and builders out there: how do you handle this?
Also, a question to the OpenAI team: are there any plans to extend the Session API with the endpoints to:
stop the session forcefully
get session status and metadata (e.g., costs & duration)
?
Open ai has introduced a usage webhook to track the active session through a websocket connection from the server also.
Whenever a webrtc connection is established from the client side after generating the ephemeral key, the webrtc response header contains a location header which contains callID which can be then sent to business server and can track the usage using a websocket server and can listen to the active connection associated with that callID
This may help https://platform.openai.com/docs/guides/realtime-server-controls