How to Fine-Tune Pronunciation with OpenAI's Text-to-Speech API?

I am using OpenAI’s text-to-speech API and would like to fine-tune the pronunciation (e.g., speed, intonation, accent). In particular, I am looking for ways to address issues such as mispronunciations or to specify how certain words or phrases should be read. Does anyone know if there are additional parameters or methods to achieve this?
Here’s the current request setup:

javascript

const requestJson = {
    model: 'tts-1',
    voice: languageCode,  // Language or voice type
    input: text,          // Text to be spoken
    speed: speakingRate   // Speaking rate (adjustable)
};
const res = await fetch(requestUrl, {
    method: 'POST',
    headers: {
        'Content-Type': 'application/json'
    },
    body: JSON.stringify(requestJson)
});
const data = await res.arrayBuffer();
const audioBlob = new Blob([data], { type: 'audio/mp3' });
const audioUrl = URL.createObjectURL(audioBlob);
return audioUrl;

Any advice or insights would be greatly appreciated!

1 Like

I have experienced the same issue with Whisper’s pronunciation.
It looks like Whisper doesn’t support SSML prosody tags (correct me guys if I’m wrong).
So, you have only two options:

  1. Use another service that supports SSML prosody tags.
  2. Use a workaround by replacing words with incorrect pronunciation. Before sending the text, replace mispronounced words with ones that sound correct to you. You’ll need to experiment to find the right substitutions.