New audio models in the API + tools for voice agents

Today, I’m excited to share that we have three new audio models in the API. We’ve also updated our Agents SDK to support the new models, making it possible to convert any text-based agent into an audio agent with a few lines of code.

Speech-to-text
You can now use gpt-4o-transcribe and gpt-4o-mini-transcribe in use cases ranging from customer service voice agents to transcribing meeting notes. The new transcribe models outperform Whisper, offering better accuracy and performance. We’ve also added bidirectional streaming so you can stream audio in, and get a stream of text back. And the streaming API supports built-in noise cancellation and a new semantic voice activity detector so you can opt for transcriptions only when the user has finished their thought (useful for building voice agents!). Noise cancellation + semantic VAD are also available in the Realtime API. For more, check out our docs.

Text-to-speech
With the new gpt-4o-mini-tts model, you can precisely control the tone, emotion, and speed of generated voices, creating more natural and engaging experiences. Starting with 10 preset voices, you can use prompts to customize speech for specific scenarios. This enables a wide range of use cases, from more empathetic and dynamic customer service voices to expressive narration for creative storytelling experiences. We’ve also built :radio: OpenAI.fm :radio:, a demo where you can try our new TTS model under our beta terms. You can read the docs to get started.

Agents SDK updates
You can now add audio capabilities to text agents by including speech-to-text and text-to-speech endcaps with just a few lines of code. To get started, visit the Agents SDK docs.If you already have a text-based agent or have a voice agent powered by speech-to-text and text-to-speech pipeline, using the new models with the Agents SDK is the best way to get started. If you’re looking to build low-latency speech-to-speech experiences, we recommend building with our speech-to-speech models in the Realtime API.

13 Likes

Realtime vision coming soon I hope!

1 Like

this is great! Can those models speaks Brazilian Portuguese too?

When will the net upgrade to RealTime API be released?

1 Like

Not optimized for it, but try prompting it to speak in Brazilian Portuguese! We’ve seen the models do very well in all sorts of languages and accents.

3 Likes

I just tested some new released APIs in from this post. They seem to be not working.
I found 2 things:

  1. Realtime API for transcription-only use cases now returns a 400 bad request with no error messages.
  2. The new semantic_vad turn detection seems to be broken, even for traditional Realtime API sessions.

Here are the details for 1:
I first created a transcription_session object:

const response = await fetch(
      "https://api.openai.com/v1/realtime/transcription_sessions",
      {
        method: "POST",
        headers: {
          Authorization: `Bearer ${apiKey}`,
          "Content-Type": "application/json",
        },
        body: JSON.stringify({
          input_audio_noise_reduction: {
            type: "far_field",
          },
          input_audio_transcription: {
            language: "en",
            model: "gpt-4o-mini-transcribe",
            prompt: "expect data science and programming words",
          },
          turn_detection: {
            eagerness: "medium", // "low", "medium", "high"
            type: "semantic_vad",
          },
        }),
      },

I got a response of realtime.transcription_session object

{
    "id": "sess_BDNVo7DGNBHyv27wXoWNW",
    "object": "realtime.transcription_session",
    "expires_at": 0,
    "input_audio_noise_reduction": null,
    "turn_detection": {
        "type": "semantic_vad",
        "eagerness": "medium"
    },
    "input_audio_format": "pcm16",
    "input_audio_transcription": {
        "model": "gpt-4o-mini-transcribe",
        "language": "en",
        "prompt": "expect data science and programming words"
    },
    "client_secret": {
        "value": "ek_xxxxxxxx",
        "expires_at": 1742535544
    },
    "include": null
}

I use the EPHEMERAL_KEY to establish WebRTC connection:

const baseUrl = "https://api.openai.com/v1/realtime";
    const model = "gpt-4o-mini-transcribe";
    const sdpResponse = await fetch(`${baseUrl}?model=${model}`, {
      method: "POST",
      body: offer.sdp,
      headers: {
        Authorization: `Bearer ${EPHEMERAL_KEY}`,
        "Content-Type": "application/sdp",
        "OpenAI-Beta": "realtime=v1",
      },
    });

This POST request return 400 BAD REQUEST with no error messages.

Here are the details for 2:
I create a traditional Realtime API session with the new sematic_vad

const response = await fetch(
      "https://api.openai.com/v1/realtime/sessions",
      {
        method: "POST",
        headers: {
          Authorization: `Bearer ${apiKey}`,
          "Content-Type": "application/json",
        },
        body: JSON.stringify({
          model: "gpt-4o-realtime-preview-2024-12-17",
          voice: "verse",
          turn_detection: {
            type: "semantic_vad",
            eagerness: "medium",
          },
        }),
      },
    );

I get an object back:

{
    "id": "sess_BDNcgumFgNZBdUsYwGkTw",
    "object": "realtime.session",
    "expires_at": 0,
    "input_audio_noise_reduction": null,
    "turn_detection": {
        "type": "semantic_vad",
        "eagerness": "medium",
        "create_response": true,
        "interrupt_response": true
    },
    "input_audio_format": "pcm16",
    "input_audio_transcription": null,
    "client_secret": {
        "value": "ek_xxxxxxxxx",
        "expires_at": 1742535970
    },
    "include": null,
    "model": "gpt-4o-realtime-preview-2024-12-17",
    "modalities": [
        "text",
        "audio"
    ],
    "instructions": "Your knowledge cutoff is 2023-10.",
    "voice": "verse",
    "output_audio_format": "pcm16",
    "tool_choice": "auto",
    "temperature": 0.8,
    "max_response_output_tokens": "inf",
    "tools": []
}

I use the EPHEMERAL_KEY to establish WebRTC connection:

const baseUrl = "https://api.openai.com/v1/realtime";
    const model = "gpt-4o-realtime-preview-2024-12-17";
    const sdpResponse = await fetch(`${baseUrl}?model=${model}`, {
      method: "POST",
      body: offer.sdp,
      headers: {
        Authorization: `Bearer ${EPHEMERAL_KEY}`,
        "Content-Type": "application/sdp",
        "OpenAI-Beta": "realtime=v1",
      },
    });

The WebRTC Connection fails:


If I remove the semantic_vad part when I create session:

turn_detection: {
type: “semantic_vad”,
eagerness: “medium”,
},

I will get a server_vad session and WebRTC connects no problem:

1 Like

I tried to establish a WebRTC connection using the command below.

  const baseUrl = "https://api.openai.com/v1/realtime/transcription_sessions";
  const model = "gpt-4o-transcribe";
  const sdpResponse = await fetch(`${baseUrl}?model=${model}`, {
    method: "POST",
    body: offer.sdp,
    headers: {
      Authorization: `Bearer ${EPHEMERAL_KEY}`,
      "Content-Type": "application/sdp"
    },
  });

But got this error:

 {
    "message": "Unsupported content type: 'application/sdp'. This API method only accepts 'application/json' requests, but you specified the header 'Content-Type: application/sdp'. Please try again with a supported content type.",
    "type": "invalid_request_error",
    "param": null,
    "code": "unsupported_content_type"
  }
1 Like

As of now I am getting an error when trying to transcribe using the new models

import OpenAI, { toFile } from 'openai'

const openai = new OpenAI()

export async function speechToText(blob: Blob): Promise<string> {
  const file = await toFile(blob, null, { type: blob.type }) // audio/wav
  const { text } = await openai.audio.transcriptions.create({
    file,
    // model: 'gpt-4o-mini-transcribe', // throws Error: 400 This model does not support the format you provided.
    model: 'whisper-1', // works
    response_format: 'json',
  })
  return text
}

Any ideas?

Thanks in advance!

2 Likes

This is wonderful news.
:mechanical_arm:

I agree, same issue with me. It isn’t working when I’m using semantic_vad, but if I revert back to server_vad, no issues.

I’m connecting over webRTC not that I’m sure it matters. Seems like a code API payload issue.

1 Like

I’m experiencing the same issue here, I can’t integrate semantic_vad using WebRTC.

1 Like

Just tested, you can use "input_audio_transcription": {"model": "gpt-4o-transcribe"} and "turn_detection": {"type": "semantic_vad"} that is GREATLY improved over server_vad and whisper, especially for portuguese and phone calls (where audio quality isn’t the best - “g711_ulaw”)

Big pricing discrepency:

OpenAI have on developer page:
GPT-4o mini TTS: 60c per 1m input tokens.

On the pricing page you have:
gpt-4o-mini-audio-preview: $10 per 1m tokens

Can you indicate which rate we should using please?

https://platform.openai.com/docs/models/gpt-4o-mini-tts

Hi Jerry - we built our realtime agent on web socket, and are beginning to move to webRTC - but your post scares me. We use the Realtime API with the Assistants API and all is fine right now.
But we’d like to use the new audio capabilities - any progress on this yet, or should we stick with socket for now?

General Question: (When) Can we use custom voices with our realtime assistants?

Still not working for me! Working for others?

One question. When you pass instructions with a non-English text, in which language should the instructions be?

In that same language? e.g. both in Spanish?

there is no mention in the API docs
https://platform.openai.com/docs/api-reference/audio/createSpeech#audio-createspeech-instructions

1 Like

OpenAI just updated their documentation. Yesterday, it stated Realtime transcription support for both WebSocket and WebRTC—but now it’s been changed to only support WebSocket**

1 Like

This is a bidirectional language model compatible with only chat completions.
Text is $0.15/$0.60
Voice is $10/$20

It is tuned to chat with you.

You would not use the rate of a model from Dec 2024 different from anything in this announcement if discussing a model just released.

It’s getting harder to keep up!

Love the design of the demo site btw :heart_eyes: