TTS: add emphasis to one word in spoken text

Is it possible to add emphasis to one specific word in the text that I want to be spoken by the TTS (audio/speech) endpoint?

Let’s say I want to have the following text to be spoken:
“Are you still using this?”
The exact meaning can be very different when the emphasis is on ‘you’ or ‘still’ or ‘this’.

How can I convince TTS to put emphasis on a certain word?




It’s somewhat possible but likely not going to work reliably. From the docs:

There is no direct mechanism to control the emotional output of the audio generated. Certain factors may influence the output audio like capitalization or grammar but our internal tests with these have yielded mixed results.

Okay, Thanks mate!
Perhaps it will be added in the future :wink:


Try these:

“Are you still using this?”

“Are you still using this?”

“Are you still using this?”

1 Like

Interesting …
The input of the audio/speech endpoint is a String.
How do I give italics to that endpoint?

This is the sample code that I use to call the endpoint:

from pathlib import Path
from openai import OpenAI
client = OpenAI()

speech_file_path = Path(__file__).parent / "speech.mp3"
response =
  input="Today is a wonderful day to build something people love!"


How do I tell a String that a word is italic?

You can try:


and why not:


Or you can get creative and add another sentence to tell the model that it should put emphasis on the word:

'Jane wanted to tell everybody about today. When she adressed the crowd she put emphasis on the word ‘Today’ and then she said: “Today is a wonderful day to build something people love!” ’

You have to play around with it, especially since

tests … have yielded mixed results.

Here’s a technique - a little bit of pause and the AI has to do a little “reset” that sounds emphasized.

So, are … you still using this?
So, are you … still using this?
So, are you still using … this?

Combined from three different runs:

1 Like

When I add to the text, I do notice some changes, but I do not know if the changes are due to the tags, or because every time the resulting speech is different anyway.

And sometimes the voice actually speaks the ‘em’ !!

1 Like

Awesome, this is actually promising!!!
I will have to play around with this!!!


1 Like

This is just plain markdown that’s being rendered as italic.

“Are you still *using* this?”

“Are you still using *this*?”

“Are you *still* using this?”

You’ll notice that the TTS will emphasize on the word between the * s