Did OpenAI just make a new AI Voice?

So, I saw this video showing an AI voice that sounds remarkable. https://www.youtube.com/watch?v=6Mi1jRszXIc

Last I checked (1-2 months ago), OpenAI’s API voices sound nowhere near this good.

Is this real? I couldn’t find any additional information about it online.

Maybe the voices on ChatGPT are different than what’s accessible using the API?

yea - it’s possible.

ChatGPT had a santa voice last year!

How can I access these new voices via the API? Do you know?

ChatGPT has different text-to-speech voice models than are available on the API.

This may be to clearly distinguish your product from their product.

To save you clicking on a tedious Youtube short, there is just a guy telling ChatGPT mobile to make pauses when it writes responses. A normal voice available in the app is used.

You can also type out umhs and ahhs when you sent text to the API, or even include cues that are often acted on instead of spoken, like [pause], or [angrily].

Then you can instruct the AI that is “speaking” to write like that … if you want to waste more user time with speech that is already plodding and laborious - and then you look and it was only two short paragraphs.


Here’s the above text, that has been “enhanced” by my instructions to GPT-4 to rewrite the TTS input.

I gave it more “like” interjections, like the ones that annoy my mom…

I see. The second one is quite good. Is the second voice available from the API?

Yes, I used the API, and the voices there are “Alloy” and “Nova”, respectively.

https://platform.openai.com/docs/guides/text-to-speech/voice-options

Here’s a system prompt to make GPT-4 into a speech informalizer:

Your job is to enhance the quality of the provided text, which is intended to be spoken by an AI text-to-voice service. You will make the resulting speech sound more natural and human-like, as if a human was thinking while speaking, and being recorded while speaking these words in an impromptu fashion, by including natural stop phrases, noises, “ums” and “ahs” into the text itself.

  • Onomatopoeic sounds that you write will be spoken aloud. They can even be approximated and will be sounded-out.

  • Short “stage directions” in square brackets, such as [pause], or [spoken angrily], [deep breath] can affect the style of speech without being spoken aloud.

  • Important: The additional vocalizations will include a LOT of interjections of the word “like” before almost anything descriptive, like how a young person talks. This use of “like” is almost every sentence, and even more common than an “um”. Other spoken phrases such as “You know”, and similar spoken-word thinking ejaculations, are also used often.

1 Like

Hello,

I don’t fully understand where to use the prompt.
I mean, is it in the code ? And would you be so kind to tell what to do cos I don’t see anything in the documentation about it.

I thank you already.

API AI models use a system message that defines how the AI is supposed to behave.

This prompt above is giving the AI additional instructions to follow when it generates that language, so that it will not just be a string of predictive AI text at a linear pace, it will have more “thinking” pauses and sounds going on that it writes into the language that it outputs, tuned for the text-to-speech engine to recognize, so speech can be more chatty-sounding.

The prompt is written to alter text, but it can also just be direction to the AI.

Giving the instruction above as “AI programming” in a chat completions system message, it transforms my response to this:

[laughs softly] So, like, API AI models? They use this, um, system message that basically, like, defines how the AI is supposed to, you know, behave.

[pause] And this prompt above? It’s giving the AI, like, additional instructions to follow when it, uh, generates that language. So it’s not just gonna be, like, a string of predictive AI text, you know, going at a linear pace. It will have, like, more “thinking” pauses and sounds, um, going on that it writes into the language that it outputs. This is all, like, tuned for the text-to-speech engine to recognize, so, like, speech can be more chatty-sounding.

Oh, and, like, the prompt? It’s written to alter text, but it can also just be, um, direction to the AI. Cool, right?

In ChatGPT, you could write a similar “custom instruction” to change how the AI produces its chat.

(post deleted by author)