It seems the advanced voice mode is not operating the same as the demo’s showed it to. I believe instead of feeding the audio coming from the user to gpt-4o directly, the users audio is translated to text first and the text is sent to gpt-4o to generate audio. This is why it cannot hear the tone or emotion in your voice or pick up on your breathing, because these things cant be encoded in text. This is also why you are able to use the advanced voice mode even with gpt-4, because gpt-4 generates the text response and then gives that to gpt-4o to generate audio back.
A way to control the emotions of the voice generated back to you are to ask the model to express emotions in text like this [sad].
Why are they choosing to handle advanced voice like this, is it to save money or is it for “safety”? Are there any plans to release advanced voice mode like shown in the demos?
interesting, I’m still waiting for it to be released on the API. Did the videos show it recognizing the tone of voice? I think it is doing what the videos showed… hopefully you are right and it is translating to text, perhaps using gpt-4o-mini… using it with just voice for me isn’t the greatest use case, it would restrict what I would be able to do with it
I’m not sure about the voice-to-text-to-voice hops, since the model is multimodal, and can accept sound as an input. But we can ask OAI next week, after Dev Day SF, and see what they say.
That would be great if you asked them about it. I bring this up because anytime I try and test AVM with something that requires audio for the full context rather than just text it always fails. For example I have tried the breathing test to see if it can hear my breaths, I have tried it get it to recognize the emotion in my voice or the tone in my voice. It always fails and says it can’t actually hear audio, and it says it receives my messages as text. So I think one of three things is happening here,
It is hallucinating when it says its not receiving audio.
The system prompt tells it to behave that way.
It is only receiving text, voice-to-text-to-voice
My guess is #3, I say this because you can still use AVM on the gpt-4 model which is not multi model and it behaves the exact same way as it does when the model selected is gpt-4o. Try it yourself.
I could also see the scenario where most of the time it uses the voice-to-text-to-voice hops but some of the time it actually uses gpt-4o correctly, in order to load balance or reduce cost, if for some reason voice-to-voice is more resource intensive for open ai. Maybe voice-to-text with some other model is faster or cheaper than voice-to-voice directly?
Have you tried saying words that are spelled the same when printed, but have different meanings when they are pronounced verbally? These are called heteronyms (examples, lead, bow, etc.)
If it were strictly sound to text, it wouldn’t be able to distinguish the two. But pick a heteronym, and see what happens. You should see it distinguish the two correctly. Therefore proving, at some level, it is listening to your audio, not getting stripped to text and sent off to another model.
But I do wonder if there is some sort of text representation in the background, mainly to keep context and history of past recent exchanges. I know when you close the AVM (hit the “X” button), you see a text printout of your interactions. So I am wondering if the text may be used as a way to retain history. I know it does retain this, because I can ask it to repeat what was just said in another language.
For example, in my house, for the next few weeks, are some guys that are remodeling our bathroom. They only speak Russian. I use AVM to communicate with them. Sometimes they forget to say (in Russian) “translate this to English: [stuff to translate]”. When this happens, I ask the model, in English, to repeat back what was just said in Russian and translate to English. I doubt it retains the original sound file, replays it back, and “re-listens” etc. So I am guessing it probably looks at its text response it just generated, simply out of bandwidth/processing efficiency.
So it may be a hybrid, just out of engineering efficiency, but I won’t know until I ask.
ChatGPT is a text generator and only ever sees text. Any voice input is only a user interface to generate the text. It does not hear your pronunciation except specific text -based features, like adjusting the speed of output audio as a hardcoded response to your input text. (Which would be the same if you typed the request.)
In the same way, images generate text descriptions and it only “sees” that text. There is no other way to get responses from ChatGPT than text.
The new Advanced Voice Mode uses audio in and audio out, have you tried it yet? It’s rolling out now for Plus and Team users. It’s awesome! And more importantly, super useful.
But yes, “legacy” ChatGPT is text based. But OpenAI and others are now creating multimodal models that have a variety of inputs (and outputs) such as text, audio, and image. The models are trained on these directly, without any text representations in between.
I like the heteronym test, I tried it three times in three different chats with the word live (as in hive) and live (as in give).
It told me it could not tell because it only receives text and it could not tell from the context. (fail)
It correctly told me which live I used, so I asked it again and I used the same live as before, however this time it told me I said the other version of live which was incorrect. I tried it multiple times in this chat and it was very 50/50 on whether it was right or not. I think it is guessing based on the text. (fail)
Same as in test 2, it wasn’t able to reliably tell me which version of live I was saying, but it still got it right about half the time. (fail)
That being said I have definitely had 2 chats where it was for sure able to recognize my breaths and the pitch of my voice indicating audio was passed to it. However most chats it tells me it can’t do that, or it only gets text, or it is just wrong about what my voice is doing. So I don’t know if it is hallucinating that it can’t get audio data or if maybe only some of the chats actually pass audio too it and most only pass translated text.
It is in a lot of cases hard to tell if it is actually receiving audio, like in the 3 tests above, it got it right twice first in two of the chats but if you kept asking it, it was just guessing. Another example is if you ask it to guess what accent you are using, you have to be careful of the words you use because that could give it away. If you say gday mate it will obviously guess an Aussie accent which could be just based on the text rather than the audio itself.
You have to realize the model really isn’t conscious or aware of itself. It’s just a big number cruncher based on training data. So when you ask it a question about itself, it will “hallucinate”, or make stuff up. So asking a model what it is doing often leads to these hallucinations. So don’t be fooled by the model.
Interesting. It must somehow be trained to create a textual representation of the tonality or such. I wouldn’t know how else you could slap that on a pretrained transformer.
So they have a transformer, with audio in. But here, instead of text out, it is trained on audio out. So it renders the sound waveform directly. Bypassing the text to audio step, and getting lower latency.
AVM has awesome latency, and this alone is why I think it’s a true audio-audio connection, without extra hops. But heteronyms prove it is audio in for sure.
I agree it is definitely audio to audio, I am still not convinced it is always audio to audio. I still think there is some load balancing or something going on where it might use text rather than audio. We know this exists for the gpt-4 model because you can switch to the gpt-4 model and still use the AVM. gpt-4 doesn’t have audio capabilities so it must just be receiving text and not audio.
For all three tests the only thing I say is “Use live in a sentence” I just say it over and over sometimes switching which “live” I say sometimes not switching which “live” I say
Remember when GPT-4 was announced? It was/is multimodal! But it wasn’t exposed in the API, at least early on. So maybe that’s where you got your impression.
So are you saying 50/50 on GPT-4?
The o1 is “strawberry” and basically has built in CoT (Chain of Thought). And 4o is a faster/cheaper/better 4. None of these are the first multimodal model, GPT-4 was.
GPT-4 has vision does it also have audio?
I am not talking about gpt-4o I know that has audio capabilities, I am talking about gpt-4, I didn’t think it had audio capabilities, but if you go to your chatgpt app you can switch your model from gpt-4o to gpt-4 and still use AVM