Voice differences between Realtime API and Text-to-Speech

ibiscp · January 7, 2025, 1:02pm

Hi everyone,

Does anyone know why the voices differ between the Realtime API and Text-to-Speech?

I’m currently implementing the Realtime API in my app, and I want users to preview the voice before using it, for example, with a sample like: “Hello, my name is OpenAI.”

The challenge I’m facing is that my use case involves two scenarios:

Allowing users to test the voice, where I plan to use Text-to-Speech.
Using the Realtime API for live interactions.

However, the difference in voices between these two services is causing some inconsistency. Is this expected behavior, or am I missing something? Any insights or recommendations would be greatly appreciated!

Thanks in advance!

LazZiya · January 8, 2025, 8:24am

Hi,

I think this is something expected, as Realtime involves emotion, emphasis and accents which is not available in text-to-speech.

Introducing the Realtime API | OpenAI

Previously, to create a similar voice assistant experience, developers had to transcribe audio with an automatic speech recognition model like Whisper⁠, pass the text to a text model for inference or reasoning, and then play the model’s output using a text-to-speech⁠(opens in a new window) model. This approach often resulted in loss of emotion, emphasis and accents, plus noticeable latency. With the Chat Completions API, developers can handle the entire process with a single API call, though it remains slower than human conversation. The Realtime API improves this by streaming audio inputs and outputs directly, enabling more natural conversational experiences. It can also handle interruptions automatically, much like Advanced Voice Mode in ChatGPT.

Topic		Replies	Views
Real Time API Voices Are Worse Than The Voice on ChatGPT Feedback	2	571	February 14, 2025
Real Time API Voice vs Chat GPT Real Time Voice! Feedback	0	769	December 20, 2024
Did OpenAI just make a new AI Voice? API	7	3067	May 16, 2024
Why not support more voices or support synthesized voices in realtime api? API realtime	5	239	February 13, 2025
Realtime API nerfed vs Advanced Voice Mode? Feedback realtime	10	2217	February 11, 2025

Voice differences between Realtime API and Text-to-Speech

Related topics