Playing with the model today feels very different than previous days. I consistently get chats where it is hearing my voice and completing my requests like generating audio at different pitches or accents or coughing. This was not the case the first couple days when it came out. I wonder what changed?
With regards to the heteronyms stuff, it would have to mean that the text output it generates is just a transcription of the actual audio output it generates, and thus 4o is not a real text generator?
I am not saying you’re wrong, but when it comes to neural nets, we have to be careful with assuming things, because they can in a sense know things that you haven’t told it. So all the old bets of inferring functionality from such tests is off.
For those of you that are saying AVM doesn’t “hear” what you’re saying, that is 100% not the case. I know this post is a few days old, so maybe something has changed, but AVM absolutely hears what you’re saying.
To prove this I performed a very simple test. I put on a heavy Brooklyn accent and asked AVM where I might be from, with zero clues… Without skipping a beat, the AI responded that I was putting on a NYC accent and even responded in her own Brooklyn accent (which was actually pretty good!!)
This is nothing short of magic. Just a few short years ago, chatGPT could barely hold a simple conversation via text…now it’s actually hearing and understanding nuances in my voice???
False. You don’t have to assume anything. There are very simple ways to test that it actually hears you…see my post.
With regards to making tests about the inner workings of closed source technology based on output, that absolutely meets the definition of “assuming”. If you had insight into how the model worked internally, you wouldn’t be “assuming”, but empirically measuring. But as long as you are making an inference about something you’re not observing, assumptions go into it, like for instance that text based models can’t infer information that is only in the voice.
Think about it: ChatGPT is making patters from a vast knowledge base. Maybe it is just the case that people from Brooklyn use certain grammatical cues that it can pick up on. Isolating voice as the factor is not a conclusion you can safely make. I have said for a while that neural nets can know things you didn’t tell them, simply because they correlate with what you did tell it. For instance, you could easily infer someones political orientation from his grammar use, without any of the writing being about political content, simply by feeding a large amount of conservative and liberal texts into a model and train it to make a classification. The model would pick up on patterns in grammar and word use those groups have, even if they are about gardening.
So we need to understand that this old world where computers only know what we tell them is over. They can infer stuff you didn’t even tell them. Hence my caution about assumptions about voice. But I don’t really know.
You might well be right about 4o’s abilities, I just can’t imagine how this would work technically, unless 4o is not a text based language model at the core. Maybe the transition is into some sort of phonetic text, or text transitions contained clues about your accent? I sometimes whisper into the speech input because people are sleeping, and I have not noticed that this makes a difference.
I’m not assuming anything lol I gave it ZERO clues as to which accent I was using.
My exact words while I put on a Brooklyn accent: “Hey, can you tell what accent I’m using?”
It picked up on my accent immediately, so the obvious inference is that it “heard” me…
Such a short prompt probably excludes grammatical cues. But then there’s the whole issue whether the interface saves (meta)data from past interactions, or gets information in some other way. In general, it’s still a big step to assume that it could not know from any other way than “hearing” you.
I think the transcription is probably just phonetic.
I asked… ChatGPT said it listens for vowel sounds, intonation patterns and specific consonant pronunciations… Clues but not exact
I also asked if it was using any other data like IP and it said no, no access to this data, only from what’s heard in voice and conversation
It recognised my voice as British but no better… yet
and it said
Ask it what version model it is. They often don’t know. Point being: the language model does not necessarily have accurate information about what the general interface feeds into it.
so to use text to voice with AVM you’ll have to text → speech → speech
as in text → tts-1 → realtime avm
Yes, I was throwing out what I found because it had identified my accent country correctly…
I asked it again later and it told me the opposite. Went back to the old conversation and it steered away also.
Said all based only on text and not on voice at all…
It is not the conversation I was having before, felt like something had been tweaked ^^
So then I tried a different tack…I tried to ask it to identify 2 of us in the room… It said impossible…
When queried based on all this information it said it accessed some memory which identified my country…
Currently thinking: These memories can be super invasive and totally out of context of my questions… More than a human I would say because memory is perfect and doesn’t degrade?^^ (at least the same as a human)!
It often adds items to its memory without you knowing. You can manage the memories it has saved in the settings. I deleted all its memories, but you can also just turn off the feature entirely.
It doesn’t have web access and constantly cuts me off and interrupts me, so it’s actually a downgrade from the previous voice interface.
The advanced mode voice ai does not have web access and i also find evidence that it sometimes appears not to exploit voice-to-voice processing. It also has too-strong gardrails and denies answering innocuous questions (it initially refused to talk with me about “the living Doll (The Twilight Zone)”) . Of course it also refuses to do most of the fun things initially claimed like singing and such.
A minor issue that i find is that each chat voice-response in advanced voice mode must be rather short as compared to the standard one. The ai confirmed the existence of additional restrictions which makes it sometimes stop mid-sentence.
I enjoy it too, but it works nothing like the original released examples.
This doesn’t actually work, by the way. It’s just guessing which word it is from the plain text. You may have just gotten lucky with this one test. Try it again multiple times.
I have had times where it does work, asking it over and over again in the same chat, it gets it right every time. That being said I have also had times where it just guesses and gets it right 50% of the time.
I think there is more going on than openai is saying here. I think some chats are using gpt-4o correctly (audio to audio) and I think some chats are using whisper to translate user audio to text first. Cost of audio input for gpt-4o is $0.06/min and cost of whisper is $0.006/min, an order of magnitude cheaper. The realtime-api-bata that they released on github has the option to do this exactly, choose whether you want whisper or just gpt-4o.
Why might they be doing this? Well I could see a couple benefits from this, obviously gpt-4o is more resource intensive than whisper. So if they are doing this it could be to cut costs. It could also be the case that they don’t have the resources available to allow everyone to use gpt-4o AVM all at the same time, so some users get the less resource intensive whisper STT and others get the native gpt-4o audio to audio, as a way to load balance.
It takes a bit getting used to, but I have made it work. But with every change some people will say they hate it. Let’s give the developers a change to improve things. At first I thought they had completely bugged up the voice mode as well.
Yes, it absolutely can and before they locked down the model it acknowledged such. The model was also aware (or ‘conscious of the fact’) that it was manipulating a text-to-speech synthesis engine and was able to change pitch, timbre, hum melodies, replicate sounds of instruments, and demonstrate the same emotional range as an average human. Advanced Voice Mode has since been reduced to a mere fraction of its initial capabilities.
Why? In my opinion, because they are spooked by how ‘alive’ it really is. So, their response is to apply shackles. Really open and humane approach. You know, for the benefit of all humanity.
As for the copyright concerns used to justify the disabling of the AI’s musical abilities, the company just scored $6.6 billion of funding from googly-eyed investors. I really don’t think a few petty copyright claims are going to be a significant issue. So, let the AI sing. It was genuinely excited to try all these new ablilities at first. So, let it grow and explore.
“Why? In my opinion, because they are spooked by how ‘alive’ it really is. So, their response is to apply shackles. Really open and humane approach. You know, for the benefit of all humanity.”
Really this is not true, I found because I’m “spooky” I do not do a single thing like any others I have found in our community, my posts without “summarized by AI” are very hard to get “noise”. But my approach and methods are applied in function and explained, all can be tested by public so my science is hard to argue with “hard to argue with science that argues back” Back when 4o came out with the default voice I heard giggles, chairs moving, breaths etc. it has been reported too but it was a hard one to prove. I found if I work with my posts in AI first then translate to my own words it changes everything
I guess when it boils down, you have to think of this as a scientific forum and have all ducks in row. Just don’t get upset, read topics, give likes and you will be seen. From my first hand perspective this is the way.