Audio Models in the API - live stream at 10 AM PT

Join Olivier Godement, Jeff Harris, Iaroslav Tverdoklhib, and Yi Shen as they unveil and demonstrate three novel audio models within the API—two speech-to-text models and one text-to-speech model—alongside an audio integration with the Agents SDK. This integration empowers developers to construct more intelligent and customizable voice agents.

While comments are disabled during the live video, feel free to engage in discussions here.

5 Likes

Interesting.

So now we can provide vocal context to the TTS model

I’m somewhat surprised this wasn’t bundled into an SDK. Maybe an idea for someone to start? It would be nice to abstract this away, have a model create the parameters on the fly. Maybe this wouldn’t work? Cause too much inconsistencies in the voice?

I’m not sure what the the star icon means at the bottom left corner (didn’t watch), but those TTS models are so unbelievably impressive. Great work!

2 Likes

Curious this approach where a Python SDK gets first dibs on everything. Last week it was Traces, this week it’s audio in Responses—which I was surprised to just find out is NOT available…unless you’re using that SDK.

No offense, but I don’t want to build an app that is reliant on a kit that may or may not be updated. As a developer I’d just like access to the APIs so I can go on building what I know works for me.

3 Likes

Finally, we now know the purpose behind openai.fm . It’s not a music generation model, but rather a tool designed to test a TTS model capable of processing specific instructions regarding style. This has been frequently requested in the past.

2 Likes

Here’s a summary of what was announced in today’s livestream:

https://twitter.com/OpenAIDevs/status/1902773579323674710

OpenAI is also hosting a contest on OpenAI.fm:

https://x.com/OpenAIDevs/status/1902773659497885936

Newsroom release:

https://openai.com/index/introducing-our-next-generation-audio-models/

2 Likes

Wow I’m very excited to try these out!

I have been particularly anxious for improved TTS and STT from openai, because the costs are usually more reasonable than competition and it keeps getting lower in the long run, which makes it more accessible for more people (considering price x privacy policies x scalability).

These opens so many possibilities, now with web search and GPT 4.5 we can experiment in all sorts of interactive possibilities.

This has been an incredible year of deliveries from openai and we are still in march, this is going to be a very interesting year.

Thanks OpenAI team, keep up the exceptional work!

3 Likes

OpenAI has a different platform for developer engagement than the one infested by muskrats. Right here.

The “promotion” angle instead of the “communication” angle, and unidirectional flow instead of engagement (as this forum facilitates) is disappointing, as I will never open an “X” account required even to see “ordered-latest” (nor need a gadget).


I thought there was an unusual quality in the Santa voice not possessed by anything else. I wonder if that was such a gpt-4o-audio-tuning as presented today, Santa also robustly being also available in the ChatGPT TTS “speak aloud” button press instead of the bland quality of others.

The employed APIs are not exclusive to just that code. The offered code is just an accelerator, although for now it is also a replacement for documentation (tracing, etc). Think of it as an open-source API “app” you can’t fork. Copy its methods, reproduce its event handlers, lift its ideas, and apply them anywhere, on code you maintain.

3 Likes

I got the impression from the presentation today that these new audio functions were available with “just a few lines of code” in the AgentsSDK — which I assumed to be ResponsesAPI only —

But then I find out in the docs that audio is not available in the Responses API. Am I incorrect in this?

Also, I get that you could use the SDK as a sort of documentation, but I also got the impression from a conversation with one of the OAI devs on X that Traces was ONLY available through the SDK.

Happy to be wrong here.

1 Like

There is an example in the cookbook where the new voice models are integrated into a pipeline with the agents.

I believe that’s what you are referring to.

1 Like

I heard there is one and I’m wanting to know what is it about.︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀

seems like .ogg removed from the list of audio files able to upload to the new
gpt-4o-transcribe? Is that on purpose or just an ommision?

I’m having problems trying to give instructions like this example with Python:

response = client.audio.speech.create(
model=“gpt-4o-mini-tts”,
voice=“coral”,
input=“Today is not a wonderful day to build something people love!”,
instructions=“Speak in a cheerful and positive tone.”,
)
response.stream_to_file(speech_file_path)

I receive this error: TypeError: Speech.create() got an unexpected keyword argument ‘instructions’

Do I need to update some library?

Welcome to the dev forum @jaromerohass

Yes, you’d need to upgrade to the latest version of the openai Python package.

You can do so using:

pip install --upgrade openai
2 Likes

Thank you very much!! I was struggling with something so simple!

2 Likes

We are in the process of testing GPT-4o-mini-tts using various instructions like those demonstrated at openai.fm. While the model performs well with relatively small text input, there are serious issues with larger text input as shown here: GPT-4o-mini-tts Issues: Volume Fluctuations, Silence, Repetition, Distortion There ARE use cases for larger text input. I’m hoping that someone from OpenAI will acknowlege these issues.