The dream isn’t far-off sci-fi anymore — it’s just unfinished business.
All the core components already exist:
ChatGPT can generate vivid, emotional text in Roman-style scenes:
She leans in, hands on her knees, dress pulled tight. Her blind eyes searching, her voice calm but firm:
“What are you waiting for?”Sora can animate that as video: body language, facial expression, even lip sync.
Voice models add matching speech – optional sugar on top.
But here’s the blocker:
Image & video AIs still generate with too much randomness. Characters don’t stay consistent across scenes.
The solution? Already proven:
Many image AIs use reference image mode — user-defined consistency with strength sliders.
Here’s the missing feature:
We need a “Persistent Avatar Mode” in DALL·E:
- Users describe their character
- DALL·E outputs 4 reference angles
- These are stored in the user profile
- Sora uses them as anchors for all video generations
That’s how you keep visual continuity.
And here’s a trick to boost realism while keeping performance low:
After the avatar speaks, the video loops a short idle animation – breathing, blinking, slight posture shifts – until the user responds.
It saves compute. And it keeps immersion alive.
That’s it. Nothing’s missing except one bold decision to connect the tools already built.
Personal AI avatars aren’t a moonshot.
They’re waiting to be shipped.
Example face made with Sora:
https://sora.chatgpt.com/g/gen_01jw3gptybfpss2m5bx3e383p4
Would you subscribe to ChatGPT Plus if it included your own visual AI avatar?