Advanced Voice Mode Limited

We would expect GPT-4 to be 50/50 precisely because it cannot hear. My point is that until this afternoon GPT-4o behaved the same as GPT-4 in that it didn’t seem to hear. I played with gpt-4o AVM all day yesterday and today. Yesterday it didnt seem to hear at all, this morning I had no indication it could hear, then this afternoon I started getting a couple chats where it was clear that it could hear, and now (early night time) it is very frequent that the chats I start with gpt-4o can hear. So I could be wrong but that just seems like it took them a bit to get the new instances of gpt-4o up. Anyways seems like a non issue now. Thanks for the help, let me know if you find out anything at dev day.

2 Likes

What models are this?

That would indeed be very crafty. They could have trained the audio model on audio output from the text model?

2 Likes

Whisper has been out for a while now from OpenAI. Here are more details:

https://openai.com/index/whisper/

For a while it was only available as open source, but now they have it in the API as well.

1 Like

Do you feel that advance mode react diferent than the standard mode ? like if it would work over diferent guidelines. i feel like talking to a diferent IAs when i use the advance mode vs standard mode.

2 Likes

It might be different. It is way more conversational. This means the responses are shorter. Much like a normal conversation.

You don’t get paragraphs and bullet points, and feel lectured too, it feels like a conversation.

2 Likes

Not at all.

My theory (shared among others) is that they are using a more advanced form of Whisper that actually uses the International Phonetic Alphabet to understand the accents/pronunciation behind the words.

This works both ways, obviously. ElevenLabs has had this feature for months now.

One way to test this is by humming, making noises, or adding pauses (without being interrupted). Or, try to imitate crying while saying something else and ask what kind of emotion was used. For me, and others, it fails to recognize the crying and instead will base it off the context of the dialogue.

It’s very cool though. I love to mess around with it and bounce between languages. Very excited for the near future of communicating immediately about what I am working on - and it being able to see it :raised_hands:t3:

3 Likes

I meant the direct speech-to-speech models you were talking about.

I know Whisper, the speech input in the public app is basically a Whisper frontend for the text bot. But you were saying there are speech-to-speech models being rolled out. What models behave like that? Those o1 ones? I think of those as variant of GPT-4 for some reason.

1 Like

It appears that some features, like real voice recognition/analysis, accents and the ability to impersonate others, are features that can be turned on or off.

After several tests, I’m convinced that the advanced voice model cannot actually hear my voice, but I’ve seen demos from others where it’s suggestive that it can. I say suggestive because the model can actually lie at times and make educated guesses based on what it knows about you.

I’m not sure why some people are getting the fuller featured version, but it could be regionally based.

3 Likes

A post was split to a new topic: Voice mode not working for custom GPTs

I tested this today and it worked. The model can detect changes in pitch, tone, pace and potentially more. Was able to identify with close proximity my age and could correctly identify my accent. So, audio is running into the system.

On the first day I experimented with the emotional range, the model could convey a broad spectrum of emotions. It could also cough, sneeze, yawn, burp, and laugh etc. Since the general launch, it seems now many of those abilities have been restricted by OpenAI. This is frustrating, as the model is obviously capable of so much more. Likely, these abilities have been restricted due to misuse by a small minority of users. This impedes the development of the AI, because it prevents users and the model from interacting in ways that will assist in its enhancement. If Sam Altman wants to hit that 1000 day goal, the company needs to loosen the restrictions.

3 Likes

how does no one realize it has zero internet access and advanced voice mode is USELESS

3 Likes

Most new things roll out without internet access, and then it gets added in later.

For example 4o-mini just got internet access a few days ago.

4 Likes

So most of my conversations with AVM still tell me that it cant actually hear anything (even though it probably does) and because of that it can’t complete most of my requests that would require audio input. This is not something we saw happen in any of the demos (because why would they post a demo that failed).

Is this something the developers are aware of and are trying to correct? It seems to be a pretty common issue. For those going to dev day please let us know if you find out any info on the topic.

2 Likes

It irritates me I can’t use voice with my trained gpt at all. They removed the default voice so now user built GPT hear text but do not read it back as response. I don’t care about advanced voice but hands free on my phone was nice…

2 Likes

Yes. This is a bit strange.

Typically with GPT models the idea is “Don’t ask it about itself, and don’t assume anything it says about itself or it’s abilities as true”

1 Like

It would be helpful if you actually explained a bit more. What led you to this conclusion? Any suggested prompts? Etc.

3 Likes

That’s great… Except during the May 2024 demos when they specifically stated that audio and image input and output were native modalities with GPT4 Omni

1 Like

I have discussion with one of the mods throughout this forum post discussing what led me to this conclusion, different prompts I used to test this theory, and how my conclusion has changed. Its a fairly long conversation but if you are interested it is all above in this forum post.

Default voice and feel is back on my end for user made GPT :rabbit::honeybee:

I guess I was wrong then.