Is realtime api directly speech to speech?

Just wondering is real time api directly STS? Have been getting different responses about the approach of how they do it. Some said it is STT, LLM, TTS, while others said its directly Speech to speech.

gpt-4o is a multimodal trained AI. This includes audio input and output.

This is exposed in special model names, gpt-4o-realtime-preview-2024-12-17 for example, or gpt-4o-audio-preview-2024-12-17 on Chat Completions.

The AI is actually accepting audio, encoded to input tokens that represent encoded audio patterns, and is generating respective audio as output for a response. It understands and produces in ā€œvoiceā€.

I see, so there is no steps of converting the ā€œvoiceā€ to ā€œtextā€ throughout the process in realtime api?

The only ā€œtext conversionā€ is providing you a transcript of the output. This uses a separate transcription service for audio to text.

There is conversion: wav audio to a tokenized spectral audio version for understanding (but not text), and the reverse codec for output. This is proprietary.

1 Like

thanks you so much & appreciate! The other thing that Iā€™m super curious to know is why does the real time api seems to be less intelligent than the chatgpt voice mode? For example, if the ai is asking me something and my response is inaudible or indistinguishable, chatgpt voice mode would ask me the question again to get the answer while the real time api would just say got it and keep moving on.

Hereā€™s a thought experiment that can get you thinking about the quality: The AI is basically speaking another language when it produces audio.

How much better might the AI be able to answer about an Indian pop star when it is speaking the Tamil language instead of English? When the bulk of the knowledge that was pretrained is in Tamil - or Hindi.

This inference skill is something emergent in large language models. That it can understand the similarity between ā€œrabbitā€, ā€œbunnyā€, ā€œhareā€, or ā€œå…Žā€ vs ā€œć†ć•ćŽā€.

And then apply that knowledge inference to complete thoughts being produced about a topic, across many written world languages it has picked up on, demonstrating where the AI has actually learned beyond the input or output language itself.

Then we must consider that a great part of the skill is post-training (like fine-tuning) on conversations. Massive collected and graded conversations. What ā€œsummarize thisā€ means. Then supervised training (ā€œI canā€™t sing for youā€).

A separate set of constructed training on audio chats must be used, and they will have a different tone to the length and language. Plus, thereā€™s simply unseen instructions and unseen audio placed in context to start things off.

1 Like

This response is amazing and took some time for me to digest. Can I interpret this to mean that you are suggesting ChatGPTā€™s voice mode has undergone more training than the real-time API for voice input/output, and therefore provides better quality?

sorry, the ui is confusing so iā€™m not sure if i correctly replied.

thanks for this amazing reply! Just wondering can I interpret this to mean that you are suggesting ChatGPTā€™s voice mode has undergone more training than the real-time API for voice input/output, and therefore provides better quality?

1 Like

Does normal ChatGPT use a different model than the API? Yes (although the API has a chatgpt-4o-latest that is supposed to be the equivalent).

Does ChatGPT have different voices? Yes. The voices are inspired by an initial context, but certainly have training also, that may or may not go across models.

Does ChatGPT ā€œadvanced voice modeā€ use a different model than any available on the API? Who knowsā€¦OpenAIā€™s secret. It certainly has different ā€œChatGPTā€ instructions than what you start with on the API, and is framed in a different application.

So basically, you get to see yourself if what is on the API will work for you and your application idea - and your budget, independently.

1 Like

this is very clear and thanks for the clarification. This is super helpful thank you!

1 Like

Does the OpenAI Real-Time API natively support text-to-speech (TTS) and speech-to-text (STT) functionalities, or do we need to configure tools like Whisper and TTS voice models manually using WebSockets?

I donā€™t know, I tried using Azure OpenAI Real-Time APIs but Azureā€™s docs are quite bit not helpful or I didnā€™t go to the right place.

What I am trying to figure out for myself is that " am I gonna implement the TTS-ChatGPT RealTime-STT ecosystem myself using Web-sockets? or DO I JUST USE THE CHATGPT REAT-TIME API?

Thanks for the clarification. I have been having this headache for weeks already)
Mukhsin

Your choice:

  1. If your goal is to provide a better Voice quality experience, and you think you can do it, then do both.
  2. If your goal is just to provide a better Voice feature experience, within the bounds of OpenAIā€™s current quality, then just do the latter.

Thank you @swooby It would be awesome if I could build the TTS-ChatGPT RealTime-STT ecosystem myself which I will try. But I think, so as to build this, OpenAIā€™s TTS & Whisper also need to support the Websockets from their end right? like for server-to-server connections of Web-Sockets. Or am I missing smth here?

Seems correct.
They even mention this in their docs:
https://platform.openai.com/docs/guides/realtime#get-started-with-the-realtime-api

Get started with the Realtime API

Learn to connect to the Realtime API using either WebRTC (ideal for client-side applications) or WebSockets (great for server-to-server applications).