The "sorry" state of the Realtime SDK

At the moment the SDK documentation and overall change management in the SDK state is not something I would expect from top Tech company like OpenAI. 1) It seems like the endpoints and the schemas for most of the server and client events have changed sometimes drastically and sometimes subtly 2) The SDK documentation is a mish mash of both the old and new schemas and conflicting each other. Check out the session.updated for example (https://platform.openai.com/docs/api-reference/realtime_server_events/session/updated) The JSON on the right says something different than the explanations on the right (older version).

Seems like without having any kind of change management (like migration docs, or even some basic versioning of documentation) they haphazardly introduced some new endpoints with new schemas and then similarly just updated some parts of the reference docs and then left the other parts out. The Voice Agents SDK is in a much worse shape and I already gave up on it, as it is an unnecessary and very leaky abstraction over the lower level realtime SDK.

If anyone from OpenAI is reading these forums, can you please be a little bit more diligent about managing your SDK releases. It is an utter mess.

4 Likes

A bit more explanations here:

If you are using the pre GA (General Availability) versions of the SDK, the endpoint to start a session with your API Key and get an API token was /v1/realtime/sessions and then to create the SDP session (WebRTC), you would ca /v1/realtime?model=MODELNAME. This still works and also works if your model name is gpt-realtime (GA model) and keeps responding with the now undocumented (or as I mentioned half documented with the new) session schemas. So if you just go ahead and update the model name, things keep working. But then if you want to add a new feature to handle some other event for example and go to the SDK docs, you will find an entirely new documentation (with conflicts).

In the “new” (GA docs) SDK, the endpoints have changed to /v1/realtime/client_secrets which BTW is really bad naming for a GA and I guess they did it instead of creating a proper /v2/realtime/sessions endpoint. And then to create the WebRTC session you would call /v1/realtime/calls?model=MODELNAME. Again instead of creating something like /v2/realtime they did this with another weird “calls” endpoint. It seems like creating this haphazard new endpoints solve the problem of keeping the existing versions running while introducing the new schemas for responses, sessions etc.
This is really not a way to properly version and document things and I am really surprised. Also makes it a mess to use something like Codex ironically because the AI will mix up these two ways of doing things, creating another spaghetti mess.
Anyways I hope OpenAI will start doing proper versioning when they introduce updates to their SDKs. This really is not the way to go.

1 Like

FYI that /v1/realtime/calls was introduced intentionally as its function is distinct from /v1/realtime, it is creating an actual call rather than just a realtime session.

2 Likes

Thanks for your quick response Justin. I see that /v1/realtime is still used if you use a Websocket based connections, so that was my bad. I was mainly looking at the WebRTC. However IMHO, these are still important to fix in the SDK:

  • Have a version based documentation so that one can look at the older versions (if they are still working) before they migrate to the new SDK.
  • I think, the changes in the GA version are so big that they would have granted a new /v2/.. endpoint. (I get that GA is the true v1 , but still it has been out for ~9 months which is a long time for such fast moving field)
  • There are a lot of inconsistencies, missing or wrong info in the SDK docs (reference docs where the descriptions are different than the JSON and the actual output) I have to currently rely on logging, trial and error to make sure it works.
  • The changes should be communicated somewhere at the very least as breaking changes. I mean you can argue that they are not breaking because the old endpoints and schemas still work but their documentation are gone.

There are also stuff that is not quite working (or may be not intended):

  • output_modalities currently does not accept both “text”, and “audio”. The SDK says so, but when I do a session.update (or try it in session creation) I get an error back.
2 Likes

yes, there are still some things in the docs that we’re working to improve. If you can list a few things that you’ve run into we can make sure they get addressed.

5 Likes

Great to hear! Here is my current list:

  • Session documentation is wrong in the API References. The text doc is the obsolete one, whereas the JSON “seems to be” the current one. They conflict each other side by side indeed. E.g.
    • input_audio_transcription (old) vs. {audio: {input: {transcription …} (new and correct in JSON sample on the right)
    • modalities vs. output_modalities + the “bug?” that both “audio” and “text” in the array returns an error in the SDK currently. The SDK docs here says it is still valid: https://platform.openai.com/docs/guides/realtime-conversations#session-lifecycle-events
    • input_audio_format (old) vs. {input: { format: { type, rate } And also the type is not documented other than audio/pcm. g711_ulaw etc. are –probably– not working or should it be audio/g711_ulaw
    • Anyways there are more but if you look at the session docs you will see that a lot of fields are contradictory.
  • conversation.item.created (and possible other events on that): These are no longer fired when input audio buffer is committed. Which is probably not reliable to use anyways because it depends on server_vad as the mode and it is unclear what the behavior is for semantic_vad mode.
  • General clarification on the old endpoints/schema (for WebRTC) vs. new ones and how long will the old ones stay active.

Voice Agent SDK:

I am not using Voice Agent SDK anymore, so these are as of 1-2 weeks back, may be fixed already:

  • The Javascript SDK was using the correct /v1/realtime/calls endpoint. But at the time, the somehow it was not rolled out properly so I had to write an interceptor to add a header for the rollout “OpenAI-Beta”, "realtime=v1” because there was no way to do that in the API. To figure that out took half a day. So it would be great to communicate rollout status or perhaps not release a new version of an SDK before the rollout is complete. (Actually Codex CLI figured this out for me eventually, Claude code was just not able to)
  • The “usage” was not returning detailed usage information (i.e. tokens vs. cached tokens vs. audio tokens etc.) so it was hard to keep track of the data.
  • Also in Voice Agent SDK, the history was not really getting updates in the emitted events. So I had to go back to using the transport level (hence the leaky abstraction) to get transcriptions.

And finally one more suggestion: It would be great to add something like what you have in the docs “Copy Page” which copies the page in markdown for LLMs to the API Reference docs. But not the full page of course but section by section. You can copy the JSON but it does not have all the info (type definitions, all possible attributes etc.)

1 Like

I have a few as well.

I’m using the realtime javascript sdk per your examples.

import {RealtimeAgent, RealtimeSession} from '@openai/agents-realtime';

  // ... stuff not worth pasting here ....

  // relevant parts....
  const agent = new RealtimeAgent({
    name: 'realtime',
    instructions: instructions.value,
    prompt: {
      promptId: 'pmpt_68bf460f699c8195a76e1a48a01473620fce5dc25a0963aa',
      version: '10',
      variables: {
        patientname: 'John Smithj'
      }
    } as any
  })

  session = new RealtimeSession(agent, {
    model: 'gpt-realtime',
  })
  await session.connect({apiKey: apiKey.value})

  1. The promptId is ignored. It wasn’t until I turned on error logging that I found what I think is the problem. I’m using your online prompt editor and gpt-realtime is not an option so this call silently fails because gpt-5 does not match gpt-realtime
  2. I get an error that patientname is not an object
  3. It is crazy hard to get to the data channel. I want to send events and it’s not clear at ALL how to do that. Looking over your source the data chanell I would use is not exposed.

Basically it seems impossible to use the existing higher level realtime API for anything real.

I should add that this helped me alot

session.on('transport_event', (event) => {
  console.log('transport event', event);
})
2 Likes

Yeah, the documentation for SIP calls is complete garbage at this point. We are not at all told the schema of the incoming realtime.call.incoming webhook or what /calls/{call_id}/acceptaccepts. /calls/{call_id}/reject will actually fire a 404 complaining that no session exists for a call with the given ID, so it seems that instead you have to /accept then /hangup - the hangup endpoint is completely undocumented except for a single dev’s comment in the thread “How can I programatically end a gpt-realtime SIP call?“. It would also be nice to be told what HTTP methods they’re supposed to be fired with besides the fact you can figure out POST /accept from the code example (they do seem to accept everything).
Also, accepting a call and connecting to its Websocket session with everything left to the defaults, then sending a response.create doesn’t actually make the model say anything like the example implies.
Also, unlike the code example given, you have to connect to the session with a ?model parameter even though you already specify it in the /accept call, otherwise you get hit with an invalid_request_error as the first message out of the gate.

tldr: it would be great if we got the full and up to date schemas for literally everything

I figured out how to send events. It was easier than I originally thought

    session.transport.sendEvent({
      type: 'response.create',
      response: {
        output_modalities: ["audio"], // or ["text","audio"]
        instructions: "hello",
        conversation: "none" 
      }
    })

The prompt stuff still confuses me. It turns out you DON’T use the prompt editor (this is in the docs but you have to read very carefully). You use an audio prompt. It just feels ‘odd’ as those are instructions not prompts. It feels strange to have these mixed concepts.

Thanks for the detailed feedback. Some of the doc updates were missed in the initial rollout, which has now been addressed.

Please see the updated docs for WebRTC and SIP usage, as well as the update on GA vs Beta formats. Let us know if anything is still unclear.

2 Likes

Great, thanks. But regarding more feedback:

  • The reject endpoint still isn’t clearly explained. Do we pass the sip_code URI of our project that we direct eg. Twilio calls to? Is this a refer by another name where we can bounce the caller to another number? Why is the call_id not enough by itself? Mind you, I was not at all familiar with the SIP standard before trying the Realtime API, beforehand I was using Twilio’s websocket mode with handling raw audio and VAD myself; more developers like me may need an explanation of exactly what the “INVITE response” that OpenAI will send back on /reject is.
  • The ?model query parameter still isn’t pointed out as mandatory in the WSS url for the call’s session. Has that behavior changed?
  • /accept fails completely silently with an OK code in some cases, I assume due to malformed input. Today I’ve been trying to add an MCP tool that lets the agent hang up on its own initiative when the conversation is done, here’s the relevant part of my JSON payload:
    // everything above "tools" is good and works...
    // Adding "tools" in this form makes this fail silently and NOT create a session for the call_id
    "tools": [
      {
        "type": "mcp",
        "server_label": "my_server", // this is just a string and probably fine
        "allowed_tools": [
          "hangup" // a simple list of strings is supposedly allowed here
        ],
        "authorization": "dummy_string", // is this strictly required? not passing this has the same result
        "require_approval": "never", // again, "always" | "never" is one of the two valid types here
        "server_description": "blah blah blah",
        // I try to make the call_id implicit in the URL
        // so I don't have to explicitly tell the model to pass it
        // is that not allowed or is my URL otherwise bad?
        "server_url": "https://temporary-domain.ngrok-free.app/mcp/rtc_9106426020e8494b81415ecb44d3304c" 
      }
    ]
    }
    
    The behavior goes like this:
    // accept call goes OK in theory:
    INFO  api::controllers::ai > Accepted call rtc_9106426020e8494b81415ecb44d3304c
    // Debug response body print: response is EMPTY, no error indicated
    Response body:
    // We try to connect over WebSocket with ?call_id and &model:
    INFO  api::session > Attempting to handle session for call rtc_9106426020e8494b81415ecb44d3304c...
    // "lol", says the API, "lmao, 404"; we haven't actually created a session with /accept
    ERROR api::session > Could not connect to OpenAI session for call ID rtc_9106426020e8494b81415ecb44d3304c, error: HTTP error: 404 Not Found
    

Have I just read the schema for the MCP tool definition wrong? Can I only pass this in an actual session.update client event (In which case, why is not noted - “similar to session.update EXCEPT those fields”)? Regardless, it would be nice to get a detailed error message if we get the inputs wrong and be told how they are wrong. Also, regarding the style of the API reference itself, it would be a big help if we were always explicitly told what is optional and what is not, think Typescript’s | undefined.

sip_code is a SIP response code, eg 603 Global Decline, which is returned back to the caller (eg Twilio) and will cause the call to be ended accordingly.

model is not needed for the side-channel connection as the model has already been selected by the parameters passed to the accept endpoint.

Regarding the issue you’re hitting with tools, I’ll take a closer look. Ideally you’d get an error back from accept here rather than a silent failure. jubertioai/hello-mcp | Val Town might also be helpful.

session.tools = [{
      type: "mcp",
      server_label: "hello-mcp-demo",
      server_url: mcpUrl,
      authorization: "demo-token-12345", // In production, use a secure token
      require_approval: "never",
    }];
1 Like

Thanks for your patience so far and the explanation. Also, sorry for the thing about model - it turned out that for the longest time I was passing callId rather than call_id which was hitting the regular session creation endpoint which does require it - in fact, I realized this just before hitting “send” on a really long message in this thread. Regarding the tools, I came up with the idea to debug this by trying to set the tools separately later in a session.update event, and when doing this over websocket I get this error:

 ERROR api::session  > Error received from server: APIError
{ type: "invalid_request_error",
code: Some("unknown_parameter"),
message: "Unknown parameter: 'session.tools[0].server_description'.",
param: Some("session.tools[0].server_description"),
event_id: Some("try_to_set_tools")
}

After simply not passing server_description the model started hitting my MCP endpoint. Haven’t yet gotten around to updating the definitions for MCP server events but I presume now something is wrong on my side and will check the repo you linked.

1 Like

@juberti works beautifully now that I’ve implemented my MCP server endpoint properly, however:


The MCP tool call conversation item and response output item actually comes with a type of mcp_call, not mcp_tool_call as the docs say here:

I might come back with more complaints after I start adding tools that actually take arguments or return outputs, but let’s hope it’s smooth sailing from here.

great, I’ve submitted some fixes to the docs based on your feedback, they should show up later today.

2 Likes

@juberti I’m following up on this with two main concerns:

  1. Our organization uses ZDR and, as of yesterday, we are unable to save system prompts from the Dashboard. The save button is now disabled and displays ‘Save is disabled for ZDR orgs.’ Has this change been made intentionally? We have previously been able to save prompts for months (with Responses API since at least June, and gpt-realtime for about 2 weeks). This is causing significant disruption for our workflow.

  2. (SOLVED) I am working with the Python Agent SDK and encountered an error when using modalities: [‘audio’, ‘text’] with the gpt-realtime model. The error message is as follows:

message=“Invalid modalities: [‘text’, ‘audio’]. Supported combinations are: [‘text’] and [‘audio’].”

This appears to originate from the API rather than the SDK itself. Could you confirm if this is expected, or is it a bug?

Thank you for your help!


Not him, but regarding number 2 it’s actually documented you can only pass one of those at once
You still get a text transcript with the response if you just pass [“audio”]

2 Likes

Thank you, Jan. You are absolutely right. I had tested the ‘audio’ modality but was not parsing the JSON correctly to extract the transcript property. I can see it working now!

1 Like

Does the issue go away if you don’t specify any modalities?

1 Like

@juberti Yes, it does! I am all set, thank you.