Just wondering is real time api directly STS? Have been getting different responses about the approach of how they do it. Some said it is STT, LLM, TTS, while others said its directly Speech to speech.
gpt-4o is a multimodal trained AI. This includes audio input and output.
This is exposed in special model names, gpt-4o-realtime-preview-2024-12-17
for example, or gpt-4o-audio-preview-2024-12-17
on Chat Completions.
The AI is actually accepting audio, encoded to input tokens that represent encoded audio patterns, and is generating respective audio as output for a response. It understands and produces in āvoiceā.
I see, so there is no steps of converting the āvoiceā to ātextā throughout the process in realtime api?
The only ātext conversionā is providing you a transcript of the output. This uses a separate transcription service for audio to text.
There is conversion: wav audio to a tokenized spectral audio version for understanding (but not text), and the reverse codec for output. This is proprietary.
thanks you so much & appreciate! The other thing that Iām super curious to know is why does the real time api seems to be less intelligent than the chatgpt voice mode? For example, if the ai is asking me something and my response is inaudible or indistinguishable, chatgpt voice mode would ask me the question again to get the answer while the real time api would just say got it and keep moving on.
Hereās a thought experiment that can get you thinking about the quality: The AI is basically speaking another language when it produces audio.
How much better might the AI be able to answer about an Indian pop star when it is speaking the Tamil language instead of English? When the bulk of the knowledge that was pretrained is in Tamil - or Hindi.
This inference skill is something emergent in large language models. That it can understand the similarity between ārabbitā, ābunnyā, āhareā, or āå ā vs āćććā.
And then apply that knowledge inference to complete thoughts being produced about a topic, across many written world languages it has picked up on, demonstrating where the AI has actually learned beyond the input or output language itself.
Then we must consider that a great part of the skill is post-training (like fine-tuning) on conversations. Massive collected and graded conversations. What āsummarize thisā means. Then supervised training (āI canāt sing for youā).
A separate set of constructed training on audio chats must be used, and they will have a different tone to the length and language. Plus, thereās simply unseen instructions and unseen audio placed in context to start things off.
This response is amazing and took some time for me to digest. Can I interpret this to mean that you are suggesting ChatGPTās voice mode has undergone more training than the real-time API for voice input/output, and therefore provides better quality?
sorry, the ui is confusing so iām not sure if i correctly replied.
thanks for this amazing reply! Just wondering can I interpret this to mean that you are suggesting ChatGPTās voice mode has undergone more training than the real-time API for voice input/output, and therefore provides better quality?
Does normal ChatGPT use a different model than the API? Yes (although the API has a chatgpt-4o-latest
that is supposed to be the equivalent).
Does ChatGPT have different voices? Yes. The voices are inspired by an initial context, but certainly have training also, that may or may not go across models.
Does ChatGPT āadvanced voice modeā use a different model than any available on the API? Who knowsā¦OpenAIās secret. It certainly has different āChatGPTā instructions than what you start with on the API, and is framed in a different application.
So basically, you get to see yourself if what is on the API will work for you and your application idea - and your budget, independently.
this is very clear and thanks for the clarification. This is super helpful thank you!
Does the OpenAI Real-Time API natively support text-to-speech (TTS) and speech-to-text (STT) functionalities, or do we need to configure tools like Whisper and TTS voice models manually using WebSockets?
I donāt know, I tried using Azure OpenAI Real-Time APIs but Azureās docs are quite bit not helpful or I didnāt go to the right place.
What I am trying to figure out for myself is that " am I gonna implement the TTS-ChatGPT RealTime-STT ecosystem myself using Web-sockets? or DO I JUST USE THE CHATGPT REAT-TIME API?
Thanks for the clarification. I have been having this headache for weeks already)
Mukhsin
Your choice:
- If your goal is to provide a better Voice quality experience, and you think you can do it, then do both.
- If your goal is just to provide a better Voice feature experience, within the bounds of OpenAIās current quality, then just do the latter.
Thank you @swooby It would be awesome if I could build the TTS-ChatGPT RealTime-STT ecosystem myself which I will try. But I think, so as to build this, OpenAIās TTS & Whisper also need to support the Websockets from their end right? like for server-to-server connections of Web-Sockets. Or am I missing smth here?
Seems correct.
They even mention this in their docs:
https://platform.openai.com/docs/guides/realtime#get-started-with-the-realtime-api
Get started with the Realtime API
Learn to connect to the Realtime API using either WebRTC (ideal for client-side applications) or WebSockets (great for server-to-server applications).