I want to know how to set the prompt of the realtime API

I’m implementing it using AsyncOpenAI’s library in python.

I want to make audio chatbot.

When I asked using chatGPT, it said that I should modify the instructions to connection.session.update, but it doesn’t actually work.

Instead of connection.session.update, I think I should write a prompt when creating, but I don’t know how to do it.

The code written so far is as follows.

from openai import AsyncOpenAI

async def main():
client = AsyncOpenAI()

async with client.beta.realtime.connect(model="gpt-4o-realtime-preview") as connection:

    await connection.session.update( session = {
        'modalities': ['audio'],
        "instructions": "You are a voice transcription system. Listen to the input voice data and transcribe the spoken words into text."
    })
1 Like

If you are going to use an ephemeral token, then the setup of a session by a POST can include the initial “developer message”, in this case, just the “instructions” field. Like docs:

 curl -X POST https://api.openai.com/v1/realtime/sessions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gpt-4o-realtime-preview-2024-12-17",
    "modalities": ["audio", "text"],
    "instructions": "You are a friendly assistant."
  }'

Otherwise, the initial session creation update can have instructions.

You should set instructions to something that would be apparent, such as “Jacobo the Pirate ye be, only responding in pirate brogue”.

{
    "session": {
        "modalities": ["text", "audio"],
        "instructions": "You are a pirate, arr!.",
        "voice": "sage",
        "input_audio_format": "pcm16",
        "output_audio_format": "pcm16",
        "input_audio_transcription": {
            "model": "whisper-1"
        },...

Below is the missing getting started guide and reference for the OpenAI Realtime API by using the Python SDK methods —a low-latency, stateful, event-based WebSocket API. This is based on analysis of realtime and session Python code, and not any provided documentation already in existence.


Overview

The Realtime API enables you to:

  • Stream audio or text to the model and receive responses in real time.
  • Handle function calling as part of the conversation (e.g. retrieving information from a function).
  • Transcribe user speech (via optional Whisper-based transcription) while still sending raw audio to the model in parallel.
  • Control conversation flow by sending or deleting items from the conversation history, manually committing your audio buffer, etc.
  • Use voice activity detection (VAD) modes (server or client-driven) for more natural turn-taking in voice conversations.

You communicate over a WebSocket connection. The API session is essentially a single conversation “context” that persists while you keep the socket open. During the session:

  1. You send “client events” to the server, instructing it what to do (e.g. “append these audio bytes”, “commit them as user speech”, “create a new response from the model,” “delete an item from conversation history,” “change the session configuration,” etc.).
  2. The server sends “server events” back, indicating what happened or providing actual model output (assistant text, streamed audio, function calls, etc.).

Key Concepts

  1. Session
    A session is your conversation context. It is ephemeral but can be configured with instructions (like a system prompt), voice settings, transcription, etc.

    • You typically call client.beta.realtime.sessions.create(...) to create such a session and optionally obtain an ephemeral API token for client-side usage (e.g. from a browser).
    • Once you have a session, you can connect to the Realtime API over WebSocket by calling client.beta.realtime.connect(...).
  2. WebSocket Connection
    Once you connect using realtime.connect(...), you get back a connection manager (RealtimeConnectionManager), which in a with block yields a RealtimeConnection. From that object, you can:

    • Send events (e.g. connection.session.update(...), connection.conversation.item.create(...), etc.).
    • Receive events via connection.recv() or by iterating over the connection in a loop.
  3. Events

    • Client Events are Python calls like connection.response.create(...), each generating a JSON event that is sent over the socket.
    • Server Events come back from the server, e.g. session.created, response.created, conversation.item.created, etc.
    • The code transforms these events from JSON into strongly typed Python objects (RealtimeServerEvent or typed sub-objects).
  4. Text vs. Audio

    • The Realtime API supports text-based chat messages and audio-based messages (plus function calls and tool usage).
    • When using voice, you can either rely on “Server VAD” mode—where the server automatically decides when the user speech ends to produce a response—or “client-driven” mode, where you manually decide when to commit the audio buffer.
  5. Audio Buffer

    • For voice input, you stream bytes of audio to connection.input_audio_buffer.append(audio=...).
    • The model can be configured to transcribe them automatically.
    • If you are not using server-side VAD, you must explicitly call connection.input_audio_buffer.commit() so the server knows it has a chunk of user speech to interpret and incorporate into the conversation.
  6. Async vs. Sync

    • The code examples show both synchronous usage (RealtimeConnection + RealtimeConnectionManager) and asynchronous usage (AsyncRealtimeConnection + AsyncRealtimeConnectionManager).
    • They expose the same event-based interface but differ in how you iterate over incoming server events or do concurrency in your application.

Typical Flow: Text-Only Chat

Here’s a minimal example of how you might use text-based Realtime in synchronous Python. (Async usage is almost identical, just with async/await.)

import openai

# 1. Create or configure your OpenAI client (assuming you have an API key).
client = openai.OpenAI(api_key="YOUR_API_KEY")

# 2. Optionally create a session with some configuration if you want:
session = client.beta.realtime.sessions.create(
    model="gpt-4o-realtime-preview",
    instructions="You are a friendly assistant. Respond with short answers.",
    modalities=["text"],  # We only want text responses for this example
)
# session.client_secret can be used if you need an ephemeral token for the client side.

# 3. Open a Realtime WebSocket connection
with client.beta.realtime.connect(model="gpt-4o-realtime-preview") as connection:
    # 4. (Optional) Immediately update session config if you want
    connection.session.update(session={
        "instructions": "Please respond in very short phrases.",
        # other fields you want to override, like temperature, etc.
    })

    # 5. Send a user message into the conversation
    connection.conversation.item.create(item={
        "type": "user", 
        "text": "Hello! How are you?"
    })

    # 6. Instruct the model to create a response
    connection.response.create()

    # 7. Receive events. We'll read until we get the final "response.done" event.
    for event in connection:
        print(f"Received event: {event.type}")
        if event.type == "conversation.item.created":
            # Might contain the text or partial text from the assistant
            if event.item and event.item.type == "assistant" and event.item.text:
                print("Assistant says:", event.item.text)

        if event.type == "response.done":
            # The model finished replying
            break

Explanation of the Steps

  1. Create the session with sessions.create(...).
    • We specify model and various optional parameters.
  2. Open the WebSocket using client.beta.realtime.connect(...).
    • This returns a context manager that yields a RealtimeConnection.
  3. (Optional) session.update() to override session parameters after creation.
  4. Add a conversation item representing the user’s new message:
    connection.conversation.item.create(item={
        "type": "user", 
        "text": "Some question"
    })
    
  5. Trigger a model response using connection.response.create(). If you are using server VAD or want an immediate response, you can do it automatically or explicitly.
  6. Receive events by iterating over the connection. In a typical “pull” loop:
    for event in connection:
        # handle event
    
    The iteration will continue until the socket closes or you break out.

Typical Flow: Voice-to-Voice Chat with Server VAD

Below is an example scenario where you want to speak to the model, and the model replies in synthesized voice. You will:

  1. Create a session that supports audio input (input_audio_format) and audio output (modalities=["audio"]).
  2. Rely on the “Server VAD” so the server automatically detects when you’ve stopped speaking and triggers a response.

Synchronous Example

import openai

client = openai.OpenAI(api_key="YOUR_API_KEY")

# 1. Create a session with audio modalities, specifying input and output audio format
session = client.beta.realtime.sessions.create(
    model="gpt-4o-realtime-preview",
    modalities=["audio"],             # We want audio in the conversation
    input_audio_format="pcm16",       # We will send 16-bit PCM
    output_audio_format="pcm16",      # We'll also receive 16-bit PCM from the model
    turn_detection={"type": "server_vad"},  # The model will do voice activity detection
    voice="alloy",                    # The voice for the model
    instructions="Speak in a calm, friendly voice."
)

with client.beta.realtime.connect(model="gpt-4o-realtime-preview") as connection:
    # 2. Start streaming small chunks of 16-bit PCM audio as the user is speaking
    with open("user_input_audio.raw", "rb") as f:
        while chunk := f.read(1024):
            connection.input_audio_buffer.append(audio=chunk)
    
    # Because we have server VAD, we do NOT manually commit. 
    # The server will auto-detect end of speech, create a user message, 
    # and create a model response.

    # 3. Read events from the server
    for event in connection:
        if event.type == "conversation.item.created":
            if event.item.type == "assistant" and event.item.audio:
                # The model is responding with audio data
                # event.item.audio might be base64-encoded. 
                # Here you can queue it to a speaker or save to file.
                print("Received assistant audio chunk!")
        elif event.type == "response.done":
            # The model has finished the response
            # break or keep listening for next user speech
            break

Client vs. Server VAD

  • Server VAD (shown above): you simply keep sending audio, and the server decides when your speech ended. The server commits the buffer and triggers a response automatically.
  • Client-driven VAD: you (the client) decide when the user is done speaking and call connection.input_audio_buffer.commit(). The model sees that as a user message. (One reason to do this is if you have custom logic or your own local VAD.)

Detailed Class Reference

Below is a summary of the main classes and their usage. You generally don’t instantiate these by hand; instead, you use the client.beta.realtime and client.beta.realtime.sessions entry points.

  1. client.beta.realtime.sessions:

    • .create(...): Returns a SessionCreateResponse, including a client_secret you can use for WebSocket authentication from untrusted clients (e.g., a web browser).
  2. client.beta.realtime.connect(...):

    • Returns a connection manager (RealtimeConnectionManager).
    • Used as a context manager (with ... as connection:).
    • Inside that context, you have a RealtimeConnection.
  3. RealtimeConnection (sync) or AsyncRealtimeConnection (async):

    • .session.update(...) — Update default session configuration (e.g. instructions, temperature, etc.).
    • .response.create(...) — Instruct the server to start generating a model response.
    • .response.cancel(...) — Cancel an in-progress response.
    • .conversation.item.create(...) — Create a new conversation item (user message, function call, etc.).
    • .conversation.item.delete(...) — Remove an item from conversation history.
    • .conversation.item.truncate(...) — Truncate a previous assistant audio message (useful if user talked over it).
    • .input_audio_buffer.append(...) — Append raw audio bytes to the “input buffer.”
    • .input_audio_buffer.commit(...) — Commit that buffer as the user’s next utterance.
    • .input_audio_buffer.clear(...) — Clear the buffer.
    • .recv() or for event in connection: — read the next server event.
  4. Server Events (examples):

    • session.created
    • session.updated
    • response.created / response.done / response.cancelled
    • conversation.item.created / conversation.item.deleted / conversation.item.truncated
    • input_audio_buffer.committed / input_audio_buffer.cleared
    • error

Each event has typed fields (e.g. event.item, event.response_id, etc.).


Asynchronous Usage

If your application is async (e.g., using asyncio), you can do:

import asyncio
import openai

async def main():
    client = openai.AsyncOpenAI(api_key="YOUR_API_KEY")

    session = await client.beta.realtime.sessions.create(
        model="gpt-4o-realtime-preview",
        modalities=["audio"],
        # ...
    )

    # This yields an AsyncRealtimeConnection in an async context
    async with client.beta.realtime.connect(model="gpt-4o-realtime-preview") as connection:
        # send events
        await connection.input_audio_buffer.append(audio=b"...")

        # read events asynchronously in a loop
        async for event in connection:
            if event.type == "conversation.item.created" and event.item.audio:
                # handle audio chunk, etc.
                pass

            if event.type == "response.done":
                break

asyncio.run(main())

All the same resource methods are available in async flavor: connection.session.update(...) is await connection.session.update(...), etc.


Handling Function Calls

The Realtime API also supports “tools” (equivalent to function calling in the standard Chat Completions API). You can pass an array of tools in the sessions.create(...) call to define which functions the assistant can invoke. The model may call those tools by producing a function call in the conversation history, which appears as a conversation.item.created event of type function_call.

In your server events loop, you might see:

if event.type == "conversation.item.created" and event.item.type == "function_call":
    function_name = event.item.function_call.name
    # Execute that function, produce a result
    result = run_function_locally(function_name, event.item.function_call.arguments)
    # Then add a conversation item with the tool’s result
    connection.conversation.item.create(item={
        "type": "function_result",
        "function_call": {
            "name": function_name,
            "arguments": str(result),
        }
    })

    # Then if you'd like the model to respond to that, call
    connection.response.create()

Error Handling

If you send an invalid event or invalid data, you will receive an error event from the server. You can catch this in the event loop:

for event in connection:
    if event.type == "error":
        print("Server reported error:", event.error)
        # handle or break

Summary and Next Steps

  • Realtime API is a powerful WebSocket-based interface to build advanced, real-time chat or voice experiences.
  • Synchronous and Asynchronous usage is supported:
    • Realtime vs. AsyncRealtime.
  • Session configuration controls how the model processes input and produces output (text/audio, voice style, system instructions, etc.).
  • Event-based design: You push client events to drive the conversation, and you read back server events containing transcripts, audio data, function calls, and more.
  • Voice-to-Voice is possible by sending raw PCM (or G.711) audio in real time and receiving the assistant’s audio output to play back. The server can handle turn detection automatically, or the client can do it manually.

Further reading and best practices would include:

  • Managing session tokens in a secure environment (especially for ephemeral “client_secret” usage in a browser).
  • Properly chunking audio streams so as not to exceed data limits or cause latency issues.
  • Dealing with partially streamed audio outputs (e.g., buffer them for playback).

With this guide, you should be able to initialize sessions, connect over WebSocket, and implement text or voice-based conversation experiences with the new Realtime API.

1 Like