Voice differences between Realtime API and Text-to-Speech

Hi everyone,

Does anyone know why the voices differ between the Realtime API and Text-to-Speech?

I’m currently implementing the Realtime API in my app, and I want users to preview the voice before using it, for example, with a sample like: “Hello, my name is OpenAI.”

The challenge I’m facing is that my use case involves two scenarios:

  1. Allowing users to test the voice, where I plan to use Text-to-Speech.
  2. Using the Realtime API for live interactions.

However, the difference in voices between these two services is causing some inconsistency. Is this expected behavior, or am I missing something? Any insights or recommendations would be greatly appreciated!

Thanks in advance!

1 Like

Hi,

I think this is something expected, as Realtime involves emotion, emphasis and accents which is not available in text-to-speech.

Introducing the Realtime API | OpenAI

Previously, to create a similar voice assistant experience, developers had to transcribe audio with an automatic speech recognition model like Whisper⁠, pass the text to a text model for inference or reasoning, and then play the model’s output using a text-to-speech⁠(opens in a new window) model. This approach often resulted in loss of emotion, emphasis and accents, plus noticeable latency. With the Chat Completions API, developers can handle the entire process with a single API call, though it remains slower than human conversation. The Realtime API improves this by streaming audio inputs and outputs directly, enabling more natural conversational experiences. It can also handle interruptions automatically, much like Advanced Voice Mode in ChatGPT.

1 Like