OpenAI talks a lot about how there must be inherent modeling of physics inside Sora to be able to express realistic video. (e.g. the coffee mug gets put down hard, how does the liquid react).
Besides those with aphantasia we all have mental images. We often have them very rapidly multiple times per second, especially when thinking and planning about the future. Note we are typically conscious of them far less frequently, but there are ways to measure this.
For example, I’m about to go to a job interview. My mind is racing thinking of every question I will be asked and preparing my answer. Or if you go more primitive this is me preparing for a hunt where I’m thinking about everything that could go wrong.
A great deal of mental energy goes towards planning for the future. In some contexts we can be accurately labeled as prediction machines, as that may be the primary benefit of a larger cortex. We are always trying to predict what comes next.
Anyway without more digression, what Sora can do is to generate mental images. Let’s say for future Sora we give it a first-person feed from a GoPro camera. Here is what we have it do. We have it generate 10,000 5 second videos, starting from the original frame. We also use language models to brainstorm all the possible / realistic scenarios given everything we know about the subjects. The LLM can use bing to find their location, earthquake alerts, it can go as deep as we want for this exercise. Maybe it can generate 100,000 more videos. These videos could be arranged in if/then trees to project even further out.
Every simulation here will be physically valid due to Sora’s inherent modeling. I think with enough compute here we may be able to predict with very high confidence what will happen in a certain scene 5 seconds into the future, all depending. And at the same time reason about how to influence the tree and which action to pick, and so on.
So it seems to me we are at a point without hard technological limits, but soft ones. Where the issue could be silo’d off into simply compute times. Where in a decade how far are compute times expected to drop? We can all agree many orders of magnitude.
Anyway, this is a topic (Sora + Mental Images) I have been thinking about and not seeing anyone really talk about. They are all talking about film production. But there is a reason OpenAI keeps mentioning there are inherent physical simulations. This topic can extrapolation a bit on what that means from my perspective.