Images input order with gpt-4 vision/omni

Hi, I’m using GPT4 Vision and Omni to help analyze some movies. I’m trying to really analyze each scene in the movie and get some insights regarding it. one of the thing I try to do is using the movie’s transcript for a specific section of it, alongside some key frames to build a textual graph of who talks to who during the movie. I was wondering if GPT-4 cares about the input order of the frames I give it, because for my task, the order makes a high difference. I read online that people try to split the messages: “this is image no.1” → img → “this is image no.2” → img2 and so on… but it seems too wasteful for my case.

To me (and maybe I’m wrong) like text or audio maintains a linear order (or bi-direction encoding), so should the images do by default?
did anyone notice a difference between GPT4-V and Omni on this topic by any chance?

2 Likes