Advanced Voice Mode isn't actually multimodal (no audio input)?

Hello!

I’ve been trying Advanced Voice Mode but I’m really confused by the way it works. It doesn’t seem truly multimodal? I’m not sure if this is a limitation on my end or if everyone’s encountering this:

  • The AI doesn’t actually hear me, it’s just transcribing my speech using Whisper and giving it as input to the model. When I sang a song, the model said I was “quoting” it and when asked about it, said it didn’t hear me sing.
  • The model repeatedly told me it only got the transcription of what I say, not the actual audio.
  • The model does have a variable degree of control over its own voice and accent, but it doesn’t seem to hear itself? It assured me its accent was American and its voice was androgynous, even though I was using the deepest voice that had a strong British accent.
  • It also seemed unable to correct its pronunciation of my name no matter how many times I asked or described it.

This is very different from what was advertised. I’m very confused because some features are there (the model speaking slowly/fast, laughing, changing its accent, tone, etc) but it doesn’t actually hear me or my tone, or any non-verbal cues. There’s no audio input? The model is also confused as to what it can or can’t do. When asked to whisper, it repeatedly said “I can’t actually whisper”… while whispering.

Does anyone else encounter these limitations? It’s been very disappointing to me so far.

(If it isn’t clear, yes, I’m talking about the advanced voice mode. The one with the fancy blue sky animation and with a time limit.)

1 Like

It definitely is watered down from what they showcased.

Playing as devil’s advocate I’d say that there’s some safety issues with these modalities. I’ve witnessed some cases where the model would copycat the user - which could really lead to some bad mental cases.

The changes they did release though are crazy. It’s much more human-like, has much better accents, and is much faster.

I just really hope they release some information on the framework behind it, or even better, introduced it as a feature for Assistants (would completely obliterate a lot of startups, as usual :rofl:)

I had the same experience. Despite having the blue orb, the model itself told me that it’s the standard mode, and it doesn’t actually seem to be multimodal. Just like you wrote, it’s transcribing what I say, and it told me that it can’t pick up on nonverbal cues.

I hope that this is just a bug and we’re not actually using the new model. Would be helpful if OpenAI took a look.

1 Like

it can’t pick up on nonverbal cues

This is very strange because it directly contradicts what they promised us months ago. Is there an actual limitation here or is the model just not aware of its own capabilities?

Yes, in my tests, I was able to reproduce that. It couldn’t reliably distinguish whether I was talking regularly, singing or whispering.
It could detect whether I was singing, relatively reliably, but not if I was whispering or not.
I think its capabilities go a little beyond just basic transcribing, it might include some more clues than just words, but not very far.
I wonder how it detects, whether the pronunciation of words were correct, when I use it for my French studies? Is it all hallucinated? Or did they train it on that?

I’ve since had experiences that seemed to imply that it does take audio as input. I think the model is just not very good at interpreting that audio yet.

  • The model correctly recognised a song I was singing (!)
  • The model correctly identified that I added an effect on my voice
  • The model recognised some emotions in my voice

It doesn’t seem to be completely reliable, I assume this stuff will get better with time.

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.