Hi, I’m using GPT4 Vision and Omni to help analyze some movies. I’m trying to really analyze each scene in the movie and get some insights regarding it. one of the thing I try to do is using the movie’s transcript for a specific section of it, alongside some key frames to build a textual graph of who talks to who during the movie. I was wondering if GPT-4 cares about the input order of the frames I give it, because for my task, the order makes a high difference. I read online that people try to split the messages: “this is image no.1” → img → “this is image no.2” → img2 and so on… but it seems too wasteful for my case.
To me (and maybe I’m wrong) like text or audio maintains a linear order (or bi-direction encoding), so should the images do by default?
did anyone notice a difference between GPT4-V and Omni on this topic by any chance?