I’ve noticed a consistent shift in accent and resonance when Sora generates my voice in different contexts. My natural accent is from the Chicago suburbs (Inland North), but in some generated videos—particularly ones labeled with “gamer,” “hockey,” or “casual” tones—the voice becomes noticeably more nasal and shifts toward a North Central / Wisconsin–Minnesota dialect.
It seems like Sora is adapting prosody and resonance to match the persona of the prompt (e.g., friendly gamer, upbeat locker-room energy), which is interesting, but it sometimes overrides the authentic voice identity. In contrast, prompts like “military” or “gym” preserve my natural tone and accent much more faithfully.
I’d love to see Sora maintain a stronger separation between:
-
Voice identity: core accent, vowel shaping, and resonance unique to the speaker
-
Contextual tone: emotional or situational expressiveness
Preserving that distinction would help keep cloned voices regionally authentic while still letting the model express tone and energy appropriate to the scene.
(For what it’s worth, this isn’t a complaint — I actually find it fascinating, and in some cases the accent drift even sounds charming. I’m offering this in the spirit of constructive feedback to help refine accent and prosody control.)
(and yes – I first asked ChatGPT why all my hockey and gamer cameo’s sounded like I was from Wisconsin or Minnesota, and then felt validated that I was actually picking up on something weird but real).
Finally, not sure if this was the proper way to give model feedback, but I couldn’t figure out any other place to email this or post it. I often have other specific things I notice about the model. It would be nice to have an easier way to provide specific user feeback.