New audio models in the API + tools for voice agents

jeffsharris · March 21, 2025, 12:01am

Today, I’m excited to share that we have three new audio models in the API. We’ve also updated our Agents SDK to support the new models, making it possible to convert any text-based agent into an audio agent with a few lines of code.

Speech-to-text
You can now use gpt-4o-transcribe and gpt-4o-mini-transcribe in use cases ranging from customer service voice agents to transcribing meeting notes. The new transcribe models outperform Whisper, offering better accuracy and performance. We’ve also added bidirectional streaming so you can stream audio in, and get a stream of text back. And the streaming API supports built-in noise cancellation and a new semantic voice activity detector so you can opt for transcriptions only when the user has finished their thought (useful for building voice agents!). Noise cancellation + semantic VAD are also available in the Realtime API. For more, check out our docs.

Text-to-speech
With the new gpt-4o-mini-tts model, you can precisely control the tone, emotion, and speed of generated voices, creating more natural and engaging experiences. Starting with 10 preset voices, you can use prompts to customize speech for specific scenarios. This enables a wide range of use cases, from more empathetic and dynamic customer service voices to expressive narration for creative storytelling experiences. We’ve also built OpenAI.fm , a demo where you can try our new TTS model under our beta terms. You can read the docs to get started.

Agents SDK updates
You can now add audio capabilities to text agents by including speech-to-text and text-to-speech endcaps with just a few lines of code. To get started, visit the Agents SDK docs.If you already have a text-based agent or have a voice agent powered by speech-to-text and text-to-speech pipeline, using the new models with the Agents SDK is the best way to get started. If you’re looking to build low-latency speech-to-speech experiences, we recommend building with our speech-to-speech models in the Realtime API.

rockettnet · March 21, 2025, 12:31am

Realtime vision coming soon I hope!

rndtavares · March 21, 2025, 12:45am

this is great! Can those models speaks Brazilian Portuguese too?

HarryJ · March 21, 2025, 3:19am

When will the net upgrade to RealTime API be released?

edwinarbus · March 21, 2025, 3:31am

Not optimized for it, but try prompting it to speak in Brazilian Portuguese! We’ve seen the models do very well in all sorts of languages and accents.

Jerry_Yang · March 21, 2025, 3:53am

I just tested some new released APIs in from this post. They seem to be not working.
I found 2 things:

Realtime API for transcription-only use cases now returns a 400 bad request with no error messages.
The new semantic_vad turn detection seems to be broken, even for traditional Realtime API sessions.

Here are the details for 1:
I first created a transcription_session object:

const response = await fetch(
      "https://api.openai.com/v1/realtime/transcription_sessions",
      {
        method: "POST",
        headers: {
          Authorization: `Bearer ${apiKey}`,
          "Content-Type": "application/json",
        },
        body: JSON.stringify({
          input_audio_noise_reduction: {
            type: "far_field",
          },
          input_audio_transcription: {
            language: "en",
            model: "gpt-4o-mini-transcribe",
            prompt: "expect data science and programming words",
          },
          turn_detection: {
            eagerness: "medium", // "low", "medium", "high"
            type: "semantic_vad",
          },
        }),
      },

I got a response of realtime.transcription_session object

{
    "id": "sess_BDNVo7DGNBHyv27wXoWNW",
    "object": "realtime.transcription_session",
    "expires_at": 0,
    "input_audio_noise_reduction": null,
    "turn_detection": {
        "type": "semantic_vad",
        "eagerness": "medium"
    },
    "input_audio_format": "pcm16",
    "input_audio_transcription": {
        "model": "gpt-4o-mini-transcribe",
        "language": "en",
        "prompt": "expect data science and programming words"
    },
    "client_secret": {
        "value": "ek_xxxxxxxx",
        "expires_at": 1742535544
    },
    "include": null
}

I use the EPHEMERAL_KEY to establish WebRTC connection:

const baseUrl = "https://api.openai.com/v1/realtime";
    const model = "gpt-4o-mini-transcribe";
    const sdpResponse = await fetch(`${baseUrl}?model=${model}`, {
      method: "POST",
      body: offer.sdp,
      headers: {
        Authorization: `Bearer ${EPHEMERAL_KEY}`,
        "Content-Type": "application/sdp",
        "OpenAI-Beta": "realtime=v1",
      },
    });

This POST request return 400 BAD REQUEST with no error messages.

Here are the details for 2:
I create a traditional Realtime API session with the new sematic_vad

const response = await fetch(
      "https://api.openai.com/v1/realtime/sessions",
      {
        method: "POST",
        headers: {
          Authorization: `Bearer ${apiKey}`,
          "Content-Type": "application/json",
        },
        body: JSON.stringify({
          model: "gpt-4o-realtime-preview-2024-12-17",
          voice: "verse",
          turn_detection: {
            type: "semantic_vad",
            eagerness: "medium",
          },
        }),
      },
    );

I get an object back:

{
    "id": "sess_BDNcgumFgNZBdUsYwGkTw",
    "object": "realtime.session",
    "expires_at": 0,
    "input_audio_noise_reduction": null,
    "turn_detection": {
        "type": "semantic_vad",
        "eagerness": "medium",
        "create_response": true,
        "interrupt_response": true
    },
    "input_audio_format": "pcm16",
    "input_audio_transcription": null,
    "client_secret": {
        "value": "ek_xxxxxxxxx",
        "expires_at": 1742535970
    },
    "include": null,
    "model": "gpt-4o-realtime-preview-2024-12-17",
    "modalities": [
        "text",
        "audio"
    ],
    "instructions": "Your knowledge cutoff is 2023-10.",
    "voice": "verse",
    "output_audio_format": "pcm16",
    "tool_choice": "auto",
    "temperature": 0.8,
    "max_response_output_tokens": "inf",
    "tools": []
}

I use the EPHEMERAL_KEY to establish WebRTC connection:

const baseUrl = "https://api.openai.com/v1/realtime";
    const model = "gpt-4o-realtime-preview-2024-12-17";
    const sdpResponse = await fetch(`${baseUrl}?model=${model}`, {
      method: "POST",
      body: offer.sdp,
      headers: {
        Authorization: `Bearer ${EPHEMERAL_KEY}`,
        "Content-Type": "application/sdp",
        "OpenAI-Beta": "realtime=v1",
      },
    });

The WebRTC Connection fails:

If I remove the semantic_vad part when I create session:

turn_detection: {
type: “semantic_vad”,
eagerness: “medium”,
},

I will get a server_vad session and WebRTC connects no problem:

innocent.akhidenor · March 21, 2025, 7:21am

I tried to establish a WebRTC connection using the command below.

  const baseUrl = "https://api.openai.com/v1/realtime/transcription_sessions";
  const model = "gpt-4o-transcribe";
  const sdpResponse = await fetch(`${baseUrl}?model=${model}`, {
    method: "POST",
    body: offer.sdp,
    headers: {
      Authorization: `Bearer ${EPHEMERAL_KEY}`,
      "Content-Type": "application/sdp"
    },
  });

But got this error:

 {
    "message": "Unsupported content type: 'application/sdp'. This API method only accepts 'application/json' requests, but you specified the header 'Content-Type: application/sdp'. Please try again with a supported content type.",
    "type": "invalid_request_error",
    "param": null,
    "code": "unsupported_content_type"
  }

Paxton · March 21, 2025, 7:58am

As of now I am getting an error when trying to transcribe using the new models

import OpenAI, { toFile } from 'openai'

const openai = new OpenAI()

export async function speechToText(blob: Blob): Promise<string> {
  const file = await toFile(blob, null, { type: blob.type }) // audio/wav
  const { text } = await openai.audio.transcriptions.create({
    file,
    // model: 'gpt-4o-mini-transcribe', // throws Error: 400 This model does not support the format you provided.
    model: 'whisper-1', // works
    response_format: 'json',
  })
  return text
}

Any ideas?

Thanks in advance!

cyclonesurfrj · March 21, 2025, 8:01am

This is wonderful news.

antoine.ferrere · March 21, 2025, 9:13am

I agree, same issue with me. It isn’t working when I’m using semantic_vad, but if I revert back to server_vad, no issues.

I’m connecting over webRTC not that I’m sure it matters. Seems like a code API payload issue.

arthur2 · March 21, 2025, 9:17am

I’m experiencing the same issue here, I can’t integrate semantic_vad using WebRTC.

miguel.grieder · March 21, 2025, 11:43am

Just tested, you can use "input_audio_transcription": {"model": "gpt-4o-transcribe"} and "turn_detection": {"type": "semantic_vad"} that is GREATLY improved over server_vad and whisper, especially for portuguese and phone calls (where audio quality isn’t the best - “g711_ulaw”)

eddyojb · March 21, 2025, 1:54pm

Big pricing discrepency:

OpenAI have on developer page:
GPT-4o mini TTS: 60c per 1m input tokens.

On the pricing page you have:
gpt-4o-mini-audio-preview: $10 per 1m tokens

Can you indicate which rate we should using please?

https://platform.openai.com/docs/models/gpt-4o-mini-tts

drewanderson · March 21, 2025, 2:07pm

Hi Jerry - we built our realtime agent on web socket, and are beginning to move to webRTC - but your post scares me. We use the Realtime API with the Assistants API and all is fine right now.
But we’d like to use the new audio capabilities - any progress on this yet, or should we stick with socket for now?

General Question: (When) Can we use custom voices with our realtime assistants?

antoine.ferrere · March 21, 2025, 3:13pm

Still not working for me! Working for others?

gianpaj · March 21, 2025, 3:50pm

One question. When you pass instructions with a non-English text, in which language should the instructions be?

In that same language? e.g. both in Spanish?

there is no mention in the API docs
https://platform.openai.com/docs/api-reference/audio/createSpeech#audio-createspeech-instructions

innocent.akhidenor · March 21, 2025, 5:45pm

OpenAI just updated their documentation. Yesterday, it stated Realtime transcription support for both WebSocket and WebRTC—but now it’s been changed to only support WebSocket**

_j · March 21, 2025, 8:59pm

This is a bidirectional language model compatible with only chat completions.
Text is $0.15/$0.60
Voice is $10/$20

It is tuned to chat with you.

You would not use the rate of a model from Dec 2024 different from anything in this announcement if discussing a model just released.

merefield · March 22, 2025, 11:53am

It’s getting harder to keep up!

Love the design of the demo site btw

Topic		Replies	Views
Realtime/transcription_sessions API returns 401 even when adding ephemeral key API transcribe , realtime	8	262	April 7, 2025
Use new model for realtime audio transcription API transcribe	5	381	April 8, 2025
Realtime API (Advanced Voice Mode) Python Implementation API gpt-4o , advanced-voice , realtime	15	8772	February 9, 2025
New models and developer products announced at DevDay Announcements announcement	70	17375	February 16, 2024
Semantic VAD might not be working with transcription mode API	2	203	April 10, 2025

New audio models in the API + tools for voice agents

Related topics