Advanced Voice Mode Limited

dignity_for_all · October 15, 2024, 9:37am

This inference is not based on an advanced voice mode, but rather on the results of testing the real-time API in the Playground.

It seems that Whisper is only used to transcribe the user’s voice and the model’s voice into text, while the model indeed receives the user’s voice and returns audio as a response.

The reason I think this is that I have experienced several instances where the transcription failed, resulting in text unrelated to what was spoken. Despite being transcribed as such, the model was still able to respond correctly to the content that was spoken.

Additionally, even when there was a frequent hallucination in Whisper, such as “Thank you for watching,” the model’s response was relevant to what was said.

While I am not sure if ChatGPT’s AVM and the real-time API are strictly the same, it seems reasonable to consider them almost identical.

Topic		Replies	Views
Announcing GPT-4o in the API! Announcements	130	108393	July 4, 2024
GPT-4 vs GPT-4o? Which is the better? Community gpt-4	80	255448	May 1, 2025
Day 12 of Shipmas: New frontier models o3 and o3-mini announcement Community shipmas	71	8239	December 26, 2024
ChatGPT can now reference all past conversations – April 10, 2025 Community announcement , chatgpt , memory	66	13973	April 23, 2025
Has anyone noticed GPT4o quality drop last few days? Feedback	88	6368	May 12, 2025

Advanced Voice Mode Limited

Related topics