Today, I’m excited to share that we have three new audio models in the API. We’ve also updated our Agents SDK to support the new models, making it possible to convert any text-based agent into an audio agent with a few lines of code.
Speech-to-text
You can now use gpt-4o-transcribe
and gpt-4o-mini-transcribe
in use cases ranging from customer service voice agents to transcribing meeting notes. The new transcribe
models outperform Whisper, offering better accuracy and performance. We’ve also added bidirectional streaming so you can stream audio in, and get a stream of text back. And the streaming API supports built-in noise cancellation and a new semantic voice activity detector so you can opt for transcriptions only when the user has finished their thought (useful for building voice agents!). Noise cancellation + semantic VAD are also available in the Realtime API. For more, check out our docs.
Text-to-speech
With the new gpt-4o-mini-tts
model, you can precisely control the tone, emotion, and speed of generated voices, creating more natural and engaging experiences. Starting with 10 preset voices, you can use prompts to customize speech for specific scenarios. This enables a wide range of use cases, from more empathetic and dynamic customer service voices to expressive narration for creative storytelling experiences. We’ve also built OpenAI.fm
, a demo where you can try our new TTS model under our beta terms. You can read the docs to get started.
Agents SDK updates
You can now add audio capabilities to text agents by including speech-to-text and text-to-speech endcaps with just a few lines of code. To get started, visit the Agents SDK docs.If you already have a text-based agent or have a voice agent powered by speech-to-text and text-to-speech pipeline, using the new models with the Agents SDK is the best way to get started. If you’re looking to build low-latency speech-to-speech experiences, we recommend building with our speech-to-speech models in the Realtime API.