What I am confused about is gpt-4o model can be used in realtime conversation. So, what is the relationship and difference between gpt-4o and gpt-realtime?
Does gpt-realtime just a name of a group of realtime models?
gpt-4o is a model family (multimodal, general-purpose). gpt-realtime is a deployment / interface optimized for low-latency streaming, mainly for audio + interactive use cases.
Think of it this way:
gpt-4o → what the model can do
gpt-realtime → how the model is exposed for real-time interaction
Realtime APIs prioritize:
Persistent connections (WebRTC / WebSocket)
Token-by-token streaming
Audio I/O with very low latency
Under the hood, realtime endpoints may run variants of 4o-class models, but you don’t select them the same way you do in standard Responses calls.
So gpt-realtime isn’t a “group of models” — it’s a realtime-optimized serving layer designed for conversational agents, voice, and live interactions.