If you are going to use an ephemeral token, then the setup of a session by a POST can include the initial “developer message”, in this case, just the “instructions” field. Like docs:
curl -X POST https://api.openai.com/v1/realtime/sessions \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-realtime-preview-2024-12-17",
"modalities": ["audio", "text"],
"instructions": "You are a friendly assistant."
}'
Otherwise, the initial session creation update can have instructions.
You should set instructions to something that would be apparent, such as “Jacobo the Pirate ye be, only responding in pirate brogue”.
{
"session": {
"modalities": ["text", "audio"],
"instructions": "You are a pirate, arr!.",
"voice": "sage",
"input_audio_format": "pcm16",
"output_audio_format": "pcm16",
"input_audio_transcription": {
"model": "whisper-1"
},...
Below is the missing getting started guide and reference for the OpenAI Realtime API by using the Python SDK methods —a low-latency, stateful, event-based WebSocket API. This is based on analysis of realtime and session Python code, and not any provided documentation already in existence.
Overview
The Realtime API enables you to:
- Stream audio or text to the model and receive responses in real time.
- Handle function calling as part of the conversation (e.g. retrieving information from a function).
- Transcribe user speech (via optional Whisper-based transcription) while still sending raw audio to the model in parallel.
- Control conversation flow by sending or deleting items from the conversation history, manually committing your audio buffer, etc.
- Use voice activity detection (VAD) modes (server or client-driven) for more natural turn-taking in voice conversations.
You communicate over a WebSocket connection. The API session is essentially a single conversation “context” that persists while you keep the socket open. During the session:
- You send “client events” to the server, instructing it what to do (e.g. “append these audio bytes”, “commit them as user speech”, “create a new response from the model,” “delete an item from conversation history,” “change the session configuration,” etc.).
- The server sends “server events” back, indicating what happened or providing actual model output (assistant text, streamed audio, function calls, etc.).
Key Concepts
-
Session
A session is your conversation context. It is ephemeral but can be configured with instructions (like a system prompt), voice settings, transcription, etc.
- You typically call
client.beta.realtime.sessions.create(...)
to create such a session and optionally obtain an ephemeral API token for client-side usage (e.g. from a browser).
- Once you have a session, you can connect to the Realtime API over WebSocket by calling
client.beta.realtime.connect(...)
.
-
WebSocket Connection
Once you connect using realtime.connect(...)
, you get back a connection manager (RealtimeConnectionManager
), which in a with
block yields a RealtimeConnection
. From that object, you can:
- Send events (e.g.
connection.session.update(...)
, connection.conversation.item.create(...)
, etc.).
- Receive events via
connection.recv()
or by iterating over the connection in a loop.
-
Events
- Client Events are Python calls like
connection.response.create(...)
, each generating a JSON event that is sent over the socket.
- Server Events come back from the server, e.g.
session.created
, response.created
, conversation.item.created
, etc.
- The code transforms these events from JSON into strongly typed Python objects (
RealtimeServerEvent
or typed sub-objects).
-
Text vs. Audio
- The Realtime API supports text-based chat messages and audio-based messages (plus function calls and tool usage).
- When using voice, you can either rely on “Server VAD” mode—where the server automatically decides when the user speech ends to produce a response—or “client-driven” mode, where you manually decide when to commit the audio buffer.
-
Audio Buffer
- For voice input, you stream bytes of audio to
connection.input_audio_buffer.append(audio=...)
.
- The model can be configured to transcribe them automatically.
- If you are not using server-side VAD, you must explicitly call
connection.input_audio_buffer.commit()
so the server knows it has a chunk of user speech to interpret and incorporate into the conversation.
-
Async vs. Sync
- The code examples show both synchronous usage (
RealtimeConnection
+ RealtimeConnectionManager
) and asynchronous usage (AsyncRealtimeConnection
+ AsyncRealtimeConnectionManager
).
- They expose the same event-based interface but differ in how you iterate over incoming server events or do concurrency in your application.
Typical Flow: Text-Only Chat
Here’s a minimal example of how you might use text-based Realtime in synchronous Python. (Async usage is almost identical, just with async
/await
.)
import openai
# 1. Create or configure your OpenAI client (assuming you have an API key).
client = openai.OpenAI(api_key="YOUR_API_KEY")
# 2. Optionally create a session with some configuration if you want:
session = client.beta.realtime.sessions.create(
model="gpt-4o-realtime-preview",
instructions="You are a friendly assistant. Respond with short answers.",
modalities=["text"], # We only want text responses for this example
)
# session.client_secret can be used if you need an ephemeral token for the client side.
# 3. Open a Realtime WebSocket connection
with client.beta.realtime.connect(model="gpt-4o-realtime-preview") as connection:
# 4. (Optional) Immediately update session config if you want
connection.session.update(session={
"instructions": "Please respond in very short phrases.",
# other fields you want to override, like temperature, etc.
})
# 5. Send a user message into the conversation
connection.conversation.item.create(item={
"type": "user",
"text": "Hello! How are you?"
})
# 6. Instruct the model to create a response
connection.response.create()
# 7. Receive events. We'll read until we get the final "response.done" event.
for event in connection:
print(f"Received event: {event.type}")
if event.type == "conversation.item.created":
# Might contain the text or partial text from the assistant
if event.item and event.item.type == "assistant" and event.item.text:
print("Assistant says:", event.item.text)
if event.type == "response.done":
# The model finished replying
break
Explanation of the Steps
- Create the session with
sessions.create(...)
.
- We specify
model
and various optional parameters.
- Open the WebSocket using
client.beta.realtime.connect(...)
.
- This returns a context manager that yields a
RealtimeConnection
.
- (Optional) session.update() to override session parameters after creation.
- Add a conversation item representing the user’s new message:
connection.conversation.item.create(item={
"type": "user",
"text": "Some question"
})
- Trigger a model response using
connection.response.create()
. If you are using server VAD or want an immediate response, you can do it automatically or explicitly.
- Receive events by iterating over the connection. In a typical “pull” loop:
for event in connection:
# handle event
The iteration will continue until the socket closes or you break out.
Typical Flow: Voice-to-Voice Chat with Server VAD
Below is an example scenario where you want to speak to the model, and the model replies in synthesized voice. You will:
- Create a session that supports audio input (
input_audio_format
) and audio output (modalities=["audio"]
).
- Rely on the “Server VAD” so the server automatically detects when you’ve stopped speaking and triggers a response.
Synchronous Example
import openai
client = openai.OpenAI(api_key="YOUR_API_KEY")
# 1. Create a session with audio modalities, specifying input and output audio format
session = client.beta.realtime.sessions.create(
model="gpt-4o-realtime-preview",
modalities=["audio"], # We want audio in the conversation
input_audio_format="pcm16", # We will send 16-bit PCM
output_audio_format="pcm16", # We'll also receive 16-bit PCM from the model
turn_detection={"type": "server_vad"}, # The model will do voice activity detection
voice="alloy", # The voice for the model
instructions="Speak in a calm, friendly voice."
)
with client.beta.realtime.connect(model="gpt-4o-realtime-preview") as connection:
# 2. Start streaming small chunks of 16-bit PCM audio as the user is speaking
with open("user_input_audio.raw", "rb") as f:
while chunk := f.read(1024):
connection.input_audio_buffer.append(audio=chunk)
# Because we have server VAD, we do NOT manually commit.
# The server will auto-detect end of speech, create a user message,
# and create a model response.
# 3. Read events from the server
for event in connection:
if event.type == "conversation.item.created":
if event.item.type == "assistant" and event.item.audio:
# The model is responding with audio data
# event.item.audio might be base64-encoded.
# Here you can queue it to a speaker or save to file.
print("Received assistant audio chunk!")
elif event.type == "response.done":
# The model has finished the response
# break or keep listening for next user speech
break
Client vs. Server VAD
- Server VAD (shown above): you simply keep sending audio, and the server decides when your speech ended. The server commits the buffer and triggers a response automatically.
- Client-driven VAD: you (the client) decide when the user is done speaking and call
connection.input_audio_buffer.commit()
. The model sees that as a user message. (One reason to do this is if you have custom logic or your own local VAD.)
Detailed Class Reference
Below is a summary of the main classes and their usage. You generally don’t instantiate these by hand; instead, you use the client.beta.realtime
and client.beta.realtime.sessions
entry points.
-
client.beta.realtime.sessions
:
.create(...)
: Returns a SessionCreateResponse
, including a client_secret
you can use for WebSocket authentication from untrusted clients (e.g., a web browser).
-
client.beta.realtime.connect(...)
:
- Returns a connection manager (
RealtimeConnectionManager
).
- Used as a context manager (
with ... as connection:
).
- Inside that context, you have a
RealtimeConnection
.
-
RealtimeConnection
(sync) or AsyncRealtimeConnection
(async):
.session.update(...)
— Update default session configuration (e.g. instructions, temperature, etc.).
.response.create(...)
— Instruct the server to start generating a model response.
.response.cancel(...)
— Cancel an in-progress response.
.conversation.item.create(...)
— Create a new conversation item (user message, function call, etc.).
.conversation.item.delete(...)
— Remove an item from conversation history.
.conversation.item.truncate(...)
— Truncate a previous assistant audio message (useful if user talked over it).
.input_audio_buffer.append(...)
— Append raw audio bytes to the “input buffer.”
.input_audio_buffer.commit(...)
— Commit that buffer as the user’s next utterance.
.input_audio_buffer.clear(...)
— Clear the buffer.
.recv()
or for event in connection:
— read the next server event.
-
Server Events (examples):
session.created
session.updated
response.created
/ response.done
/ response.cancelled
conversation.item.created
/ conversation.item.deleted
/ conversation.item.truncated
input_audio_buffer.committed
/ input_audio_buffer.cleared
error
Each event has typed fields (e.g. event.item
, event.response_id
, etc.).
Asynchronous Usage
If your application is async (e.g., using asyncio
), you can do:
import asyncio
import openai
async def main():
client = openai.AsyncOpenAI(api_key="YOUR_API_KEY")
session = await client.beta.realtime.sessions.create(
model="gpt-4o-realtime-preview",
modalities=["audio"],
# ...
)
# This yields an AsyncRealtimeConnection in an async context
async with client.beta.realtime.connect(model="gpt-4o-realtime-preview") as connection:
# send events
await connection.input_audio_buffer.append(audio=b"...")
# read events asynchronously in a loop
async for event in connection:
if event.type == "conversation.item.created" and event.item.audio:
# handle audio chunk, etc.
pass
if event.type == "response.done":
break
asyncio.run(main())
All the same resource methods are available in async flavor: connection.session.update(...)
is await connection.session.update(...)
, etc.
Handling Function Calls
The Realtime API also supports “tools” (equivalent to function calling in the standard Chat Completions API). You can pass an array of tools
in the sessions.create(...)
call to define which functions the assistant can invoke. The model may call those tools by producing a function call in the conversation history, which appears as a conversation.item.created
event of type function_call
.
In your server events loop, you might see:
if event.type == "conversation.item.created" and event.item.type == "function_call":
function_name = event.item.function_call.name
# Execute that function, produce a result
result = run_function_locally(function_name, event.item.function_call.arguments)
# Then add a conversation item with the tool’s result
connection.conversation.item.create(item={
"type": "function_result",
"function_call": {
"name": function_name,
"arguments": str(result),
}
})
# Then if you'd like the model to respond to that, call
connection.response.create()
Error Handling
If you send an invalid event or invalid data, you will receive an error
event from the server. You can catch this in the event loop:
for event in connection:
if event.type == "error":
print("Server reported error:", event.error)
# handle or break
Summary and Next Steps
- Realtime API is a powerful WebSocket-based interface to build advanced, real-time chat or voice experiences.
- Synchronous and Asynchronous usage is supported:
Realtime
vs. AsyncRealtime
.
- Session configuration controls how the model processes input and produces output (text/audio, voice style, system instructions, etc.).
- Event-based design: You push client events to drive the conversation, and you read back server events containing transcripts, audio data, function calls, and more.
- Voice-to-Voice is possible by sending raw PCM (or G.711) audio in real time and receiving the assistant’s audio output to play back. The server can handle turn detection automatically, or the client can do it manually.
Further reading and best practices would include:
- Managing session tokens in a secure environment (especially for ephemeral “client_secret” usage in a browser).
- Properly chunking audio streams so as not to exceed data limits or cause latency issues.
- Dealing with partially streamed audio outputs (e.g., buffer them for playback).
With this guide, you should be able to initialize sessions, connect over WebSocket, and implement text or voice-based conversation experiences with the new Realtime API.