This inference is not based on an advanced voice mode, but rather on the results of testing the real-time API in the Playground.
It seems that Whisper is only used to transcribe the user’s voice and the model’s voice into text, while the model indeed receives the user’s voice and returns audio as a response.
The reason I think this is that I have experienced several instances where the transcription failed, resulting in text unrelated to what was spoken. Despite being transcribed as such, the model was still able to respond correctly to the content that was spoken.
Additionally, even when there was a frequent hallucination in Whisper, such as “Thank you for watching,” the model’s response was relevant to what was said.
While I am not sure if ChatGPT’s AVM and the real-time API are strictly the same, it seems reasonable to consider them almost identical.