Is realtime api directly speech to speech?

wangzian1215 · January 13, 2025, 3:09am

Just wondering is real time api directly STS? Have been getting different responses about the approach of how they do it. Some said it is STT, LLM, TTS, while others said its directly Speech to speech.

_j · January 13, 2025, 3:16am

gpt-4o is a multimodal trained AI. This includes audio input and output.

This is exposed in special model names, gpt-4o-realtime-preview-2024-12-17 for example, or gpt-4o-audio-preview-2024-12-17 on Chat Completions.

The AI is actually accepting audio, encoded to input tokens that represent encoded audio patterns, and is generating respective audio as output for a response. It understands and produces in “voice”.

wangzian1215 · January 13, 2025, 3:22am

I see, so there is no steps of converting the “voice” to “text” throughout the process in realtime api?

_j · January 13, 2025, 3:27am

The only “text conversion” is providing you a transcript of the output. This uses a separate transcription service for audio to text.

There is conversion: wav audio to a tokenized spectral audio version for understanding (but not text), and the reverse codec for output. This is proprietary.

wangzian1215 · January 13, 2025, 3:37am

thanks you so much & appreciate! The other thing that I’m super curious to know is why does the real time api seems to be less intelligent than the chatgpt voice mode? For example, if the ai is asking me something and my response is inaudible or indistinguishable, chatgpt voice mode would ask me the question again to get the answer while the real time api would just say got it and keep moving on.

_j · January 13, 2025, 3:49am

Here’s a thought experiment that can get you thinking about the quality: The AI is basically speaking another language when it produces audio.

How much better might the AI be able to answer about an Indian pop star when it is speaking the Tamil language instead of English? When the bulk of the knowledge that was pretrained is in Tamil - or Hindi.

This inference skill is something emergent in large language models. That it can understand the similarity between “rabbit”, “bunny”, “hare”, or “兎” vs “うさぎ”.

And then apply that knowledge inference to complete thoughts being produced about a topic, across many written world languages it has picked up on, demonstrating where the AI has actually learned beyond the input or output language itself.

Then we must consider that a great part of the skill is post-training (like fine-tuning) on conversations. Massive collected and graded conversations. What “summarize this” means. Then supervised training (“I can’t sing for you”).

A separate set of constructed training on audio chats must be used, and they will have a different tone to the length and language. Plus, there’s simply unseen instructions and unseen audio placed in context to start things off.

wangzian1215 · January 13, 2025, 4:06am

This response is amazing and took some time for me to digest. Can I interpret this to mean that you are suggesting ChatGPT’s voice mode has undergone more training than the real-time API for voice input/output, and therefore provides better quality?

wangzian1215 · January 13, 2025, 4:08am

sorry, the ui is confusing so i’m not sure if i correctly replied.

thanks for this amazing reply! Just wondering can I interpret this to mean that you are suggesting ChatGPT’s voice mode has undergone more training than the real-time API for voice input/output, and therefore provides better quality?

_j · January 13, 2025, 5:08am

Does normal ChatGPT use a different model than the API? Yes (although the API has a chatgpt-4o-latest that is supposed to be the equivalent).

Does ChatGPT have different voices? Yes. The voices are inspired by an initial context, but certainly have training also, that may or may not go across models.

Does ChatGPT “advanced voice mode” use a different model than any available on the API? Who knows…OpenAI’s secret. It certainly has different “ChatGPT” instructions than what you start with on the API, and is framed in a different application.

So basically, you get to see yourself if what is on the API will work for you and your application idea - and your budget, independently.

wangzian1215 · January 13, 2025, 6:18am

this is very clear and thanks for the clarification. This is super helpful thank you!

muxsinmuxtorov01 · January 13, 2025, 5:27pm

Does the OpenAI Real-Time API natively support text-to-speech (TTS) and speech-to-text (STT) functionalities, or do we need to configure tools like Whisper and TTS voice models manually using WebSockets?

I don’t know, I tried using Azure OpenAI Real-Time APIs but Azure’s docs are quite bit not helpful or I didn’t go to the right place.

What I am trying to figure out for myself is that " am I gonna implement the TTS-ChatGPT RealTime-STT ecosystem myself using Web-sockets? or DO I JUST USE THE CHATGPT REAT-TIME API?

Thanks for the clarification. I have been having this headache for weeks already)
Mukhsin

swooby · January 13, 2025, 11:50pm

Your choice:

If your goal is to provide a better Voice quality experience, and you think you can do it, then do both.
If your goal is just to provide a better Voice feature experience, within the bounds of OpenAI’s current quality, then just do the latter.

muxsinmuxtorov01 · January 14, 2025, 9:47am

Thank you @swooby It would be awesome if I could build the TTS-ChatGPT RealTime-STT ecosystem myself which I will try. But I think, so as to build this, OpenAI’s TTS & Whisper also need to support the Websockets from their end right? like for server-to-server connections of Web-Sockets. Or am I missing smth here?

swooby · January 14, 2025, 3:11pm

Seems correct.
They even mention this in their docs:
https://platform.openai.com/docs/guides/realtime#get-started-with-the-realtime-api

Get started with the Realtime API

Learn to connect to the Realtime API using either WebRTC (ideal for client-side applications) or WebSockets (great for server-to-server applications).

Topic		Replies	Views
Multiple API calls - high latency; options / product suggestion API chatgpt	20	3860	September 22, 2023
What will the GPT-4o audio API look like? API audio , gpt-4o	9	4015	October 2, 2024
Web Speech API with whisper API whisper	1	792	July 24, 2025
Can WebRTC Be Used for a Real-Time Text-to-Text Chatbot Instead of WebSockets? API	8	966	September 22, 2025
Streaming from Text-to-Speech api API api , python , tts	53	56919	January 21, 2025

Is realtime api directly speech to speech?

Related topics