I’m curious about the maximum number of parallel real-time (streaming) sessions that can be run using a single OpenAI API key.
Has anyone had practical experience with this—specifically, how many model instances can be safely run simultaneously on one key before hitting limits (rate limits, 429 errors, etc.)?
Then calculate. (You don’t even indicate what model or modality you are considering).
Make some test API calls see the token consumption of an interactive session.
The rate limiter will kick in when it has a chance to inspect the tokens of a new request. I’m not quite sure if they consider and inspect each “create” trigger in a realtime session for blocking. The realtime audio models have a smaller context window, 32k or 16k, so you can have relatively high costs and recurring token consumption, but not infinite costs per model response.
Consider: worst case is some saying “hello” over and over to a full chat context session, for many of those 32k per minute. At $0.032 per 1K tokens to gpt-realtime and a dollar a response.
Sorry for not providing enough details earlier. We’re currently testing the gpt-realtime-mini model. The limits are clear, but there’s no information about parallel (concurrent) sessions.
I’m wondering whether I need to use multiple API keys to handle, for example, around 200 concurrent realtime sessions
There is no given limit to concurrency, other than your own resources to handle and proxy audio streams. OpenAI likely has a bigger cloud than you. Only oddities like “Assistants” had impractical API call limits.
It is not the authorization mechanism or the organization, as an API key doesn’t get “consumed”. Rather, the rate endpoint worker for ingress and Cloudflare, where you don’t have much control over it doing a good job of load distribution and not calling you a DDoS attack, unless you come from different IP addresses (like the pattern of WebRTC with ephemeral key handed over to the client).
Some who have cranked things up high on chat endpoints have found that there have been some premature API limits before you could reach 500 requests per second of a published rate (or burst more than that with the intention of staying under the minute rate). But a low tier limit TPM and RPM will likely be your first concern.