Join Olivier Godement, Jeff Harris, Iaroslav Tverdoklhib, and Yi Shen as they unveil and demonstrate three novel audio models within the API—two speech-to-text models and one text-to-speech model—alongside an audio integration with the Agents SDK. This integration empowers developers to construct more intelligent and customizable voice agents.
While comments are disabled during the live video, feel free to engage in discussions here.
So now we can provide vocal context to the TTS model
I’m somewhat surprised this wasn’t bundled into an SDK. Maybe an idea for someone to start? It would be nice to abstract this away, have a model create the parameters on the fly. Maybe this wouldn’t work? Cause too much inconsistencies in the voice?
I’m not sure what the the star icon means at the bottom left corner (didn’t watch), but those TTS models are so unbelievably impressive. Great work!
Curious this approach where a Python SDK gets first dibs on everything. Last week it was Traces, this week it’s audio in Responses—which I was surprised to just find out is NOT available…unless you’re using that SDK.
No offense, but I don’t want to build an app that is reliant on a kit that may or may not be updated. As a developer I’d just like access to the APIs so I can go on building what I know works for me.
Finally, we now know the purpose behind openai.fm . It’s not a music generation model, but rather a tool designed to test a TTS model capable of processing specific instructions regarding style. This has been frequently requested in the past.
I have been particularly anxious for improved TTS and STT from openai, because the costs are usually more reasonable than competition and it keeps getting lower in the long run, which makes it more accessible for more people (considering price x privacy policies x scalability).
These opens so many possibilities, now with web search and GPT 4.5 we can experiment in all sorts of interactive possibilities.
This has been an incredible year of deliveries from openai and we are still in march, this is going to be a very interesting year.
OpenAI has a different platform for developer engagement than the one infested by muskrats. Right here.
The “promotion” angle instead of the “communication” angle, and unidirectional flow instead of engagement (as this forum facilitates) is disappointing, as I will never open an “X” account required even to see “ordered-latest” (nor need a gadget).
I thought there was an unusual quality in the Santa voice not possessed by anything else. I wonder if that was such a gpt-4o-audio-tuning as presented today, Santa also robustly being also available in the ChatGPT TTS “speak aloud” button press instead of the bland quality of others.
The employed APIs are not exclusive to just that code. The offered code is just an accelerator, although for now it is also a replacement for documentation (tracing, etc). Think of it as an open-source API “app” you can’t fork. Copy its methods, reproduce its event handlers, lift its ideas, and apply them anywhere, on code you maintain.
I got the impression from the presentation today that these new audio functions were available with “just a few lines of code” in the AgentsSDK — which I assumed to be ResponsesAPI only —
But then I find out in the docs that audio is not available in the Responses API. Am I incorrect in this?
Also, I get that you could use the SDK as a sort of documentation, but I also got the impression from a conversation with one of the OAI devs on X that Traces was ONLY available through the SDK.
I heard there is one and I’m wanting to know what is it about.︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀︀
I’m having problems trying to give instructions like this example with Python:
response = client.audio.speech.create(
model=“gpt-4o-mini-tts”,
voice=“coral”,
input=“Today is not a wonderful day to build something people love!”,
instructions=“Speak in a cheerful and positive tone.”,
)
response.stream_to_file(speech_file_path)
I receive this error: TypeError: Speech.create() got an unexpected keyword argument ‘instructions’
We are in the process of testing GPT-4o-mini-tts using various instructions like those demonstrated at openai.fm. While the model performs well with relatively small text input, there are serious issues with larger text input as shown here: GPT-4o-mini-tts Issues: Volume Fluctuations, Silence, Repetition, Distortion There ARE use cases for larger text input. I’m hoping that someone from OpenAI will acknowlege these issues.