New audio models in the API + tools for voice agents

Fixed it for you for current UK spirit on the state of it’s own country:

Voice: morose, downcast, bitter, unhappy, cynical, angry, hungover

Punctuation: laboured, drawn out.

Delivery: very slow

Phrasing: laboured, drawn out.

Tone: morose, downcast, bitter, unhappy, cynical, hungover
Alright, team, let's bring the energy—time to move, sweat, and feel amazing!

We're starting with a dynamic warm-up, so roll those shoulders, stretch it out, and get that body ready! Now, into our first round—squats, lunges, and high knees—keep that core tight, push through, you got this!

Halfway there, stay strong—breathe, focus, and keep that momentum going! Last ten seconds, give me everything you've got!

And… done! Take a deep breath, shake it out—you crushed it! Stay hydrated, stay moving, and I'll see you next time!

semantic_vad is not working for me also. WebRTC connection fails. anyone have any ideas?

Why not? These are my settings. You can’t pass server_vad settings like threshold, prefix_padding_ms, silence_duration_ms when using semantic_vad

    session_update = {
        "type": "session.update",
        "session": {
            "turn_detection": {
                "type": "semantic_vad",
                "create_response": True,
            },
            "tools": attendant_data["tools"],
            "tool_choice": "auto",
            "input_audio_format": "g711_ulaw",
            "output_audio_format": "g711_ulaw",
            "voice": attendant_data["voice"],
            "instructions": instructions,
            "modalities": ["text", "audio"],
            "temperature": 0.7,
            "input_audio_transcription": {"model": "gpt-4o-mini-transcribe"},
        },
    }
    await openai_ws.send(json.dumps(session_update))

For all those who are struggling I know the answer. The openAI API doc is confusing and unclear on this point.

You cannot initial and create a new WebRTC session with semantic_vad, you can only use semantic_vad with a session.update call.

So, open your webRTC connexion normally with server_vad and then simply update that session using session.update and the delta semantic_vad and other relevant parameters, like eagerness etc.

Based on detailed testing while the WebRTC actual opens with semantic settings, it fails to negotiate ICE properly and drops out.

Here’s what the AI documented in the fix:

Semantic VAD Implementation Fix

Created: March 23, 2025

Problem Statement

The application was encountering a critical issue with WebRTC session negotiation when using semantic VAD (Voice Activity Detection). When a user selected “semantic_vad” in the Agent Settings page, the initial session creation would fail because:

  1. The OpenAI real-time API documentation states that only server_vad is supported for initial session creation
  2. Semantic VAD can only be activated after a session is created, via a session.updateevent
  3. Our implementation was trying to create the initial session with semantic_vad, causing negotiation failures
2 Likes

Hi Jeff,
I wasn’t sure where to post this but here it is: First off, thank you for building such a powerful, life-enhancing tool. As someone who uses ChatGPT daily and consistently across multiple domains — from writing to coaching to creative strategy — I’ve been impressed by how intuitive, intelligent, and responsive the GPTs have become.

That said, I’d like to offer a forward-thinking suggestion that could take this tool from brilliant to transformational:

:backhand_index_pointing_right: Allow GPTs to collaborate with one another or share context across sessions.

In my case, I have one GPT helping me write a book (with a dedicated editor I named Agnes) and another supporting me in music production and creative direction. Right now, I serve as the bridge between them — and while I don’t mind that role, I see enormous value in these tools being able to “talk” to one another.

Imagine if Agnes could automatically understand themes from my song lyrics… or where my farm-to-table GPT could reference ideas developed in coaching GPT. That kind of synergy would open doors not just for convenience, but for deep creative integration and relational intelligence. It’s what real teams do — collaborate, contextualize, and adapt together. Why not empower GPTs to do the same?
You’re onto something historic here — and I believe a connected GPT experience is a natural (and necessary) evolution.

I getting an error after creating the session via WebSocket:
First I get:

{
  "type": "transcription_session.created",
  "event_id": "event_BGKoc7w5r6ij1ONmkNRUZ",
  "session": {
    "id": "sess_BGKocMLKbWK0xmgmCbb3R",
    "object": "realtime.transcription_session",
    "expires_at": 1743234762,
    "input_audio_noise_reduction": null,
    "turn_detection": {
      "type": "server_vad",
      "threshold": 0.5,
      "prefix_padding_ms": 300,
      "silence_duration_ms": 200
    },
    "input_audio_format": "pcm16",
    "input_audio_transcription": null,
    "client_secret": null,
    "include": null
  }
}

But then before sendind any other message:

{
  "type": "error",
  "event_id": "event_BGKocYsLRigNZ8RcIG3Me",
  "error": {
    "type": "invalid_request_error",
    "code": "missing_required_parameter",
    "message": "Missing required parameter: 'type'.",
    "param": "type",
    "event_id": null
  }
}
Does anyone now whats wrong for was a working example?