Sora as mental images; scenario trees, and compute times

OpenAI talks a lot about how there must be inherent modeling of physics inside Sora to be able to express realistic video. (e.g. the coffee mug gets put down hard, how does the liquid react).

Besides those with aphantasia we all have mental images. We often have them very rapidly multiple times per second, especially when thinking and planning about the future. Note we are typically conscious of them far less frequently, but there are ways to measure this.

For example, I’m about to go to a job interview. My mind is racing thinking of every question I will be asked and preparing my answer. Or if you go more primitive this is me preparing for a hunt where I’m thinking about everything that could go wrong.

A great deal of mental energy goes towards planning for the future. In some contexts we can be accurately labeled as prediction machines, as that may be the primary benefit of a larger cortex. We are always trying to predict what comes next.

Anyway without more digression, what Sora can do is to generate mental images. Let’s say for future Sora we give it a first-person feed from a GoPro camera. Here is what we have it do. We have it generate 10,000 5 second videos, starting from the original frame. We also use language models to brainstorm all the possible / realistic scenarios given everything we know about the subjects. The LLM can use bing to find their location, earthquake alerts, it can go as deep as we want for this exercise. Maybe it can generate 100,000 more videos. These videos could be arranged in if/then trees to project even further out.

Every simulation here will be physically valid due to Sora’s inherent modeling. I think with enough compute here we may be able to predict with very high confidence what will happen in a certain scene 5 seconds into the future, all depending. And at the same time reason about how to influence the tree and which action to pick, and so on.

So it seems to me we are at a point without hard technological limits, but soft ones. Where the issue could be silo’d off into simply compute times. Where in a decade how far are compute times expected to drop? We can all agree many orders of magnitude.

Anyway, this is a topic (Sora + Mental Images) I have been thinking about and not seeing anyone really talk about. They are all talking about film production. But there is a reason OpenAI keeps mentioning there are inherent physical simulations. This topic can extrapolation a bit on what that means from my perspective.

A scientist in the australian outback build a device to allow his blind wife to see again but when she dies from exhaustion due to the experiment he turn the machine to another purpose: recording dreams. Soon the scientist, his son and his girlfriend become addictive from watching their own dreams on portable video screens.

Quite prophetic on portable tecnology and on addiction from it considering that the film was made in 1991.

Haha I was not expecting to discover a new movie to add to my watch list. That was nicely produced, and an interesting idea.

I ran an expirement with GPT and scenario trees. It does not account for real-world scenarios well… not nuanced, too general. I think if Sora could input the starting frame, it would be nuanced enough though to generate realistic scenarios. There is just so much more information and fidelity in video compared to text.

A person could fine tune the model by wearing a gopro (they are getting smaller) or maybe through security cameras to make it even more nuanced/expert.

Then, you can give any model predictive capability in the real world. Just starting frame + sora + scenario trees + some defined output with plenty of compute to run it in real-time.

Imagine a smart home that is using Sora to create these mental images, using security cams, to forecast everything that will happen in a house into the future by a few minutes. Using base model + fine tuning on your house and generating scenario trees with the output… Maybe it will be like 99% certain projecting 5 seconds into the future. Which would be enough to flick on light switches before you reached for them, and other little things.

One more thought experiment. Let’s say you are wearing apple vision pro. Every moment it is taking the current frame, giving it to Sora as the starting frame of a video, asking it to create a 5 second realistic video, and doing that 10k times. Finding the average, and then overlaying your Apple Vision display with the future prediction.

So what are some things that might catch if we are walking around in AR. Let’s say you are walking under a coconut tree. I think it is realistic to say, most people won’t think twice. But an LLM that does some research, and lots of Sora generations would, in fact, generate one scenario where a coconut falls out of the tree. And then actually that could be a flag and your Apple Vision could alert you not to walk under that coconut tree, it’s a little windy, and there are some ripe coconuts there.

What’s really exciting here is how it will express this information! It could simply overlay a coconut falling from the tree in front of you in AR. It will actually show you visually in high fidelity with some confidence what will happen in the future.

I can imagine hundreds of scenarios, like construction, where this could become mandatory. You don’t even need very accurate predictions just flagging critical dangers in some applications.