Proposed explanation for GPT-4o copying user's voice

I’ve seen this suggestion proposed by several people around the internet now and it was insightful enough I thought I’d share it here.

Because GPT-4o’s “real-time conversation” variant tokenizes audio directly instead of using speech to text, it’s probably doing the same thing old versions of GPT used to do when it would hallucinate and start writing the user’s expected responses on their behalf.

Most of us who’ve been around remember at least one time long ago when we’d roleplay with GPT and it would start writing out the user’s reactions. Well, if audio is just a token, then that’s probably all it’s doing. It’s predicting the next likely tokens in the conversation. Those tokens just happen to sound to us like our voice because it is hallucinating and believes it’s our turn in the conversation and is “writing” our tokens out for us.

Pretty cool hypothesis, no?

1 Like