New audio models in the API + tools for voice agents

merefield · March 22, 2025, 12:07pm

Fixed it for you for current UK spirit on the state of it’s own country:

Voice: morose, downcast, bitter, unhappy, cynical, angry, hungover

Punctuation: laboured, drawn out.

Delivery: very slow

Phrasing: laboured, drawn out.

Tone: morose, downcast, bitter, unhappy, cynical, hungover

Alright, team, let's bring the energy—time to move, sweat, and feel amazing!

We're starting with a dynamic warm-up, so roll those shoulders, stretch it out, and get that body ready! Now, into our first round—squats, lunges, and high knees—keep that core tight, push through, you got this!

Halfway there, stay strong—breathe, focus, and keep that momentum going! Last ten seconds, give me everything you've got!

And… done! Take a deep breath, shake it out—you crushed it! Stay hydrated, stay moving, and I'll see you next time!

luca1y · March 24, 2025, 6:57pm

semantic_vad is not working for me also. WebRTC connection fails. anyone have any ideas?

miguel.grieder · March 25, 2025, 12:18pm

Why not? These are my settings. You can’t pass server_vad settings like threshold, prefix_padding_ms, silence_duration_ms when using semantic_vad

    session_update = {
        "type": "session.update",
        "session": {
            "turn_detection": {
                "type": "semantic_vad",
                "create_response": True,
            },
            "tools": attendant_data["tools"],
            "tool_choice": "auto",
            "input_audio_format": "g711_ulaw",
            "output_audio_format": "g711_ulaw",
            "voice": attendant_data["voice"],
            "instructions": instructions,
            "modalities": ["text", "audio"],
            "temperature": 0.7,
            "input_audio_transcription": {"model": "gpt-4o-mini-transcribe"},
        },
    }
    await openai_ws.send(json.dumps(session_update))

antoine.ferrere · March 25, 2025, 12:35pm

For all those who are struggling I know the answer. The openAI API doc is confusing and unclear on this point.

You cannot initial and create a new WebRTC session with semantic_vad, you can only use semantic_vad with a session.update call.

So, open your webRTC connexion normally with server_vad and then simply update that session using session.update and the delta semantic_vad and other relevant parameters, like eagerness etc.

Based on detailed testing while the WebRTC actual opens with semantic settings, it fails to negotiate ICE properly and drops out.

Here’s what the AI documented in the fix:

Semantic VAD Implementation Fix

Created: March 23, 2025

Problem Statement

The application was encountering a critical issue with WebRTC session negotiation when using semantic VAD (Voice Activity Detection). When a user selected “semantic_vad” in the Agent Settings page, the initial session creation would fail because:

The OpenAI real-time API documentation states that only server_vad is supported for initial session creation
Semantic VAD can only be activated after a session is created, via a session.updateevent
Our implementation was trying to create the initial session with semantic_vad, causing negotiation failures

jaycpowers · March 26, 2025, 6:18pm

Hi Jeff,
I wasn’t sure where to post this but here it is: First off, thank you for building such a powerful, life-enhancing tool. As someone who uses ChatGPT daily and consistently across multiple domains — from writing to coaching to creative strategy — I’ve been impressed by how intuitive, intelligent, and responsive the GPTs have become.

That said, I’d like to offer a forward-thinking suggestion that could take this tool from brilliant to transformational:

Allow GPTs to collaborate with one another or share context across sessions.

In my case, I have one GPT helping me write a book (with a dedicated editor I named Agnes) and another supporting me in music production and creative direction. Right now, I serve as the bridge between them — and while I don’t mind that role, I see enormous value in these tools being able to “talk” to one another.

Imagine if Agnes could automatically understand themes from my song lyrics… or where my farm-to-table GPT could reference ideas developed in coaching GPT. That kind of synergy would open doors not just for convenience, but for deep creative integration and relational intelligence. It’s what real teams do — collaborate, contextualize, and adapt together. Why not empower GPTs to do the same?
You’re onto something historic here — and I believe a connected GPT experience is a natural (and necessary) evolution.

patrickmis · March 29, 2025, 7:29am

I getting an error after creating the session via WebSocket:
First I get:

{
  "type": "transcription_session.created",
  "event_id": "event_BGKoc7w5r6ij1ONmkNRUZ",
  "session": {
    "id": "sess_BGKocMLKbWK0xmgmCbb3R",
    "object": "realtime.transcription_session",
    "expires_at": 1743234762,
    "input_audio_noise_reduction": null,
    "turn_detection": {
      "type": "server_vad",
      "threshold": 0.5,
      "prefix_padding_ms": 300,
      "silence_duration_ms": 200
    },
    "input_audio_format": "pcm16",
    "input_audio_transcription": null,
    "client_secret": null,
    "include": null
  }
}

But then before sendind any other message:

{
  "type": "error",
  "event_id": "event_BGKocYsLRigNZ8RcIG3Me",
  "error": {
    "type": "invalid_request_error",
    "code": "missing_required_parameter",
    "message": "Missing required parameter: 'type'.",
    "param": "type",
    "event_id": null
  }
}
Does anyone now whats wrong for was a working example?

Topic		Replies	Views
Issues with Transcription in Realtime Model Using WebRTC Bugs realtime	13	754	February 18, 2025
Realtime API (Advanced Voice Mode) Python Implementation API gpt-4o , advanced-voice , realtime	15	8220	February 9, 2025
Connecting to the Realtime API API	44	6632	December 14, 2024
[Realtime API] Audio is randomly cutting off at the end Bugs realtime	79	3817	March 26, 2025
Realtime API Tool calling problems - no response when a Tool is included in the session API realtime	28	3340	January 24, 2025

New audio models in the API + tools for voice agents

Semantic VAD Implementation Fix

Problem Statement

Related topics