Reading videos with GPT4V

Hello folks, today I’ve made something so that you don’t have to.

I’ve tried feeding GPT-4-Vision a video - for cheap!

Normally you would transcribe most of the frames in a video and then summarize it with AI. Well, while this works to some extent, it’s can get very expensive, especially with longer clips. Also, it looks like my method is more efficient, but more on this below.

From my tests, I found out that GPT-4-Vision can read sequences of images in a single image, which allowed me to do this. I also used Whisper to increase the quality of the reading.

The short clip which I’ll feed into is this one: Well never thought of it like that #chile #fypシ゚viral #tiktok #fyp #sc... | TikTok (22 seconds)

The method

I wrote a script which splits the video into 9 frames and extracts the audio, using FFMPEG. It then makes a “photo grid” out of the frames. I’ve then fed the audio into Whisper to transcribe it, and immediately after that fed both into GPT-4-Vision using following prompt:

The image shows video frames in sequence. Describe what’s likely going on in each frame.
You can hear following in the audio track: ${audio_transcript}. Also mention what you can hear in the audio.

The wording and “likely” was needed to avoid “Can’t assist you with that” and other shenanigans. Seems that GPT4V doesn’t like to guess stuff.
“Also mention what you can hear in the audio.” was needed for it to not summarize or skip audio cues which might be important. It’s better to try and preserve them to use it in the next step.

Which spit out:

The image appears to be a sequence of comic strip panels featuring an interaction between a green pepper-like character and a human scientist with two human-like assistants.

Frame 1: The green pepper is speaking. Its speech bubble says, “To prevent animals from eating me, I developed an irritating chemical.”

Frame 2: The pepper continues to speak, and the dialogue in the speech bubble reads, “that makes me mildly uncomfortable to consume!”

Frame 3: The scientist, who is wearing a lab coat and glasses, starts to speak, addressing the pepper. The text reads, “We’ve selectively bred…”

Frame 4: The focus is on the scientist who is now holding a smaller red pepper-like character and continues the sentence with, “your offspring until one produced so much of that chemical.”

Frame 5: The sentence concludes with the scientist still holding the small red pepper, saying, “eating it causes severe physical and psychological distress.” The small red pepper greets the larger one cheerfully with a friendly “hello!”

Frame 6: This panel features multiple instances of the smaller red pepper, each with a stylized speech bubble saying “MUNCH MUNCH” around them, illustrating that the smaller red pepper is being eaten, presumably by humans.

The last two panels appear to have the word “MUNCH” repeated in various sizes, creating an overwhelming effect to match the audio from the track mentioned, indicating the extreme reaction caused by consuming the pepper loaded with the irritating chemical.

Based on the text in the speech bubbles and the audio described, the comic seems to be a satirical take on how natural defense mechanisms of plants (like the pepper’s irritating compound) have been exaggerated through selective breeding, leading to the creation of very hot or spicy varieties that can cause intense reactions when eaten. The audio saying “MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM” likely represents the sound of someone enjoying the consumption of the pepper despite the chemical’s unpleasant effects, which is an ironic twist to the narrative. The “HELLO” heard in the audio coincides with the friendly greeting of the red offspring pepper to the original green pepper character.

While this is already pretty good, let’s run it through GPT-4 and summarize it, using following prompt:

Explain what’s happening in the text, but this time refer to it as a short clip, not individual frames.

I’ve had to add “this time” to avoid getting “I’m not able to complete this request” for some reason. Also, wording matters, I didn’t use “Summarize” as it strips out a lot of useful information.
It spit out following:

In this short clip, we are presented with a satirical and anthropomorphized portrayal of selective plant breeding.

Initially, a character resembling a green pepper explains to a human scientist that it has developed an irritating chemical as a defense mechanism to prevent animals from consuming it. The pepper notes with a degree of pride that this chemical makes it “mildly uncomfortable to consume.”

As the clip progresses, the scientist, along with two assistants, informs the green pepper that humans have engaged in selective breeding with its species. The scientist reveals that they have amplified the very trait the pepper evolved for protection—its chemical irritant.

In a somewhat twisted turn of events, it’s shown that they have bred a smaller red pepper variant that contains such an intense amount of this irritant that consuming it now causes “severe physical and psychological distress.” This smaller pepper greets the original green one with a cheery “hello,” seemingly oblivious to the consequences of the breeding.

Throughout the clip, the sound that accompanies it intensifies—paralleling the increasing potency of the peppers. The red peppers are being consumed with zeal, as evidenced by the sound of “MUNCH MUNCH,” and a prolonged “MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM,” which might signify someone relishing their spicy taste.

Ultimately, the short clip highlights the irony of how human intervention has taken a natural defense mechanism and turned it into a culinary challenge or enjoyment, with people sometimes seeking out the very discomfort plants evolved to deter predators. Despite the severe effects, there is a strange pleasure derived from these hot peppers, as suggested by the prolonged, satisfied sound in the audio.

As we can see, even though it couldn’t see all frames, the sequence of frames together with Whisper was sufficient for it to assemble the pieces together. It grasped stuff that could only be heard, not seen.

Although this method is a hit or miss, it works okay enough if the scene doesn’t change a lot. And even if it misses a scene, Whisper can act as a backup and “fill the gaps” to some extent.

It might fall short if no voice can be heard but only noises or other sounds, as Whisper doesn’t transcribe noises. Sometimes it does, but it’s pretty rare. My guess it wasn’t trained on this, but some samples of random transcribed sounds slipped in and made it into training.


Why am I sharing this?

At first, I was a bit hesitant of sharing this technique, as it’s a smart yet cheap technique, which will probably just be copied without crediting anyone, but I’ve still decided to do it for two key reasons:

  • While it works, GPT-4-Vision was likely not trained on a lot of “sequences” of images, which is apparent on a lot of clips when it just cannot grasp simple yet obvious things that changed from one frame to another. I’ll post an update on this.
    Making this method more known hopefully makes OpenAI train their Vision LLM on more image sequences, so it can become even more feasible in the future.

  • Whisper likes to hallucinate a lot when the audio is too quiet or simply silent, which throws the description off. Eg. if the audio is silent, I’ve seen it produce “Thanks for watching!” and similar stuff. When it combines this with the frames, it’ll try to reason why this makes sense - although it doesn’t. Maybe if more people use this technique, it gains sufficient attention so that someone fixes this. A Whisper v4 trained transcribing noises and other sounds would be incredible and add to the overall quality. One can dream.


Can I get better results?

I’ve experimented around with producing 5x5 grids, and similar, which also produces interesting results. However, if you put too much frames into one image, it can get overwhelmed and skip things. YMMV!
In all cases it works better when detail is set to high, so it can see the image in a higher resolution, and will allow you to squeeze more images into a single image.
Maybe if you add timestamped transcriptions to the vision model, it can more closely predict what’s going on. Needs to be tested!
Also, with longer clips it might make sense to generate multiple photo grids slightly interlaced so it can continue to “see” the progression, and then run it through a summarization step, like above.


Is this better than OpenAI’s recommendation transcribing individual frames?

From what I’ve seen, yes.
When you ask it to transcribe individual images, it can’t see movement or how the scene progresses, which makes it more prone to transcribe things over and over again, missing things that obviously moved in the video, yet are still there, or similar.
It also makes the summarization step much bigger, which will eat all your tokens as lunch.

How cheap is it?

From my tests, pretty cheap. To run all the following steps (complete transcription) on the above clip:

  • Transcribing the audio using Whsiper v2 Large (API)
  • Reading the grid using GPT-4-Vision (API)
  • Summarization using GPT-4-Turbo (API)

I’ve used 3 cents, give or take.

If this somehow is still “too expensive” for your use, you could combine the vision and summarization step into one, although it’s possible that the quality wouldn’t be as high. YMMV! It looks like running it through 2 steps adds more reasoning to it and also skips less details, which is good.


Cheers!

21 Likes

Very nice work. Getting the most out of these models is not always intuitive, thanks for sharing.

1 Like

Impressive and inspiring work, thanks for sharing! @Fusseldieb

1 Like

Thanks @moonlockwood and @kjordan !

The update


Where it falls short

Let’s take following clip: Reddit - Dive into anything (9 seconds)

I’ve laughed way too hard at this clip, but the AI can’t comprehend what’s happening in it, even while it’s clear as day that the air conditioner just vanished from one frame to another.

Final transcription:

In the short clip, it appears to depict a man interacting with an air conditioning unit in a room. Initially, he stands near the unit, looking intrigued or slightly confused. He reaches towards and possibly fiddles with the machine, as if trying to activate or adjust it. Shortly afterward, he steps back, presumably expecting the air conditioner to start functioning. However, his expression quickly turns to one of shock or surprise, and we observe him starting to back away from the unit, seemingly in response to an unexpected occurrence, like a loud noise or a forceful blast of air.

As the man becomes more flustered, he retreats farther from the air conditioner, now in a defensive posture. This suggests the unit may be malfunctioning or has produced an effect he did not anticipate. Almost immediately, he turns and swiftly moves away from the unit, reinforcing the idea that something has gone wrong. He leaves the vicinity with a look of alarm on his face, glancing back at the air conditioner with concern as if to verify the source of the problem.

Throughout this sequence, the clip shows the man’s quick transition from curiosity to surprise, and finally to a state of alarm, ending with him hastily exiting the scene. Given there is no audio provided, it is not possible to infer additional context from any sounds that might be related to the clip.

Even feeding the clip through GPT4V multiple times doesn’t change the outcome much. It still thinks the air conditioner is there and the man is somehow interacting with it.

Where Whisper confuses GPT4V

This time, let’s use a short clip which just shows a cat come closer and closer to the camera. The audio clip is silent, but exists. Here’s where Whisper’s problem emerges.

The final transcription:

Based on the description of the short clip, we can surmise the following narrative and context:

The clip begins with a cat cautiously emerging from its hiding place, perhaps behind some furniture or a doorway, exhibiting signs of curiosity or wariness. As the clip progress, the cat’s movements forward and facial expressions evolve, suggesting a heightening of its interest or possible surprise in response to an unseen stimulus, maybe a human interaction, another animal, or an unexpected noise.

The central portion of the clip focuses on a close-up of the cat’s face, emphasizing its wide-eyed stare, which could indicate a variety of feline emotions such as surprise, intense curiosity, or even mild alarm. The cat holds its gaze steadily, likely captivated by whatever it is observing.

Towards the end of the clip, we are presented with very similar scenes where the cat remains deeply concentrated on the subject of its attention. It doesn’t display obvious signs of readiness to pounce or flee, suggesting that its interest is piqued but not to the point of invoking a strong fight or flight response.

The background audio mentions “Subtitles provided by DimaTorzok”, indicating the presence of translated text accompaniment, commonly included for viewers to understand spoken dialogue or commentary not in their native language.

Overall, the clip appears to capture a candid moment of a cat’s typical behavior when faced with something intriguing or unexpected, highlighting the animal’s capacity for focused attention and the classic feline mix of caution and curiosity."

As you can see, it hallucinates:

Subtitles provided by DimaTorzok

It sometimes tries to reason pretty deep why someone suddenly says “Like and Subscribe” in a random video, although nobody says that and it is just a hallucination.

Other thing’s that I’ve saw were:

  • Thanks for watching
  • Subtitles by …
  • Like and Subscribe (or any similar variation)
  • Subtitles provided by …

That sounds like uncurated training data to me. They likely used all sorts of transcriptions from videos from several platforms and things like those that were at the end of these videos slipped into training. Since those were silent parts, they emerge on silent audio.

It’s possible to filter that out with a GPT-3.5 step. However, if it happens that someone actually says that, GPT-3.5 would wrongfully filter it out. One thing that could work to some extent is maybe filter out single phrases like these and only process the transcription if it’s longer than X words.

4 Likes

Too good! Can’t not laugh.

This is all so interesting. I bet one tiny change that might take hours or days to find and it would work for that clip, but would probably break other things. You can tell it has the analytical capability to figure it out, it’s just missing something that would let it notice the thing isn’t there any more and flew out the damn window…

For sure, feeding whisper silence is hilarious, all kinds of random stuff in random languages.

Levels might be the way? Check for silent clips and ignore them. Feels like whisper needs to find a main voice to latch on to or it starts transcribing digital noise.

If you fed each frame individually right around where it flies out and asked ‘is there an air conditioner in this image?’ it might shed some light on what’s happening.

You could try an i-frame/p-frame kinda deal and have a handful of big detailed images then lots of little ones that connect the dots. If they were organised in a way gptv would recognise as a timeline I bet it would work.

1 Like

Thanks for reading through!

These are interesting points and questions, to which I have answers.

Levels might be the way? Check for silent clips and ignore them. Feels like whisper needs to find a main voice to latch on to or it starts transcribing digital noise.

I gave it a clip of me walking loudly while Whisper was configured for “pt”, and it transcribed:

Legendas pela comunidade do Amara.org

Which translates to: “Subtitles by the community of Amara.org”, which is another hallucination.

EDIT: Looks like this is an ongoing discussion: Dataset bias ("❤️ Translated by Amara.org Community") · openai/whisper · Discussion #928 · GitHub

So yea, Levels already flew out of the window - too.
You would need some sort of LLM to tell you if there is voice, or not. Then, it might work.
Or simply just a sucessor to the Whisper model which correctly says “(Silence)” or something along the lines when the audio is quiet.

If you fed each frame individually and asked ‘is there an air conditioner in this image?’ it might shed some light on what’s happening.

This will pretty much work, however, this question is heavily dependant on the video. You would need to run a step which asks all sorts of questions, which will get inaccurate pretty fast.

1 Like

Ah, of course. You’re right, without searching for voice it can’t work. I’ll have to send whisper some striaght wind noise see what that comes out as, bet it’s comedy. Have seen some pretty weird output from non-vocal inputs.

Thanks again this is very interesting.

1 Like

Or what you can do:

  • resize the image to 512px wide.
  • use the low resolution mode
  • pay only for a single tile of image recognition instead of having the AI split the image haphazardly into subtiles (up to 16 plus the low resolution base) at your expense…
2 Likes

Nice.

I haven’t looked into gptv much. So it segments first thing and makes sub tiles to work on? Any documentation on how this works? Thanks.

I didn’t quite get that. Can you elaborate?

1 Like

in hi-res it splits the image into 512x512 tiles. If you feed it 1024x1024 you’ll get charged for 4 tiles, split nice and neat. Same as feeding them individually.

Ah yes, I knew that, yes!

However, giving it 512x512 isn’t quite enough when the frames have small text or details. Although I must agree with you that I could’ve resized my image to 1024x1024, as it is 910x910 currently.

https://platform.openai.com/docs/guides/vision/calculating-costs

The image you send is first downsized so that the longest dimension is 2048 or under.

The image is again downsized so the shortest dimension is 768 or under.

Then a tile grid is split over that image … tiles which are 512x512. This means that a “photo grid” like is shown in the first image at 910x910 would be downsized to 768x768. That takes up the area of four tiles, with 25% of the resulting 1024x1024 grid unused (or overlapped). If you were to send a 720p image (or 3x3 in its size) that’s a processed area of 1280x720, six tiles, and unless you make a double-width image, full 1920x1080 will be downsized similarly, to 1366x768.

So send a single tile 512x512, you get the highest comprehension of all with no AI confusion about which image is being referred to. No odd size wasting or multi-tile blending. Then specify “low” to ensure there is no token overbilling.

1 Like

So i think what’s happening is it can only ingest 512x512 max, so you might as well just send 512x512 and have full control.

If you’re curious there’s a little script gpt4 wrote in here that does the i/p frame thing just ask for it and you’ll get a link, may be useful. Not a real gpt, just a dropbox.

https://chat.openai.com/g/g-4E3EofiWu-cap-n-caption

I think you can set master grid size and output resolution, if not drop it in the gpt code and tell it to sort it out. This one is working nice, no idea why, but it does real well most of the time:

https://chat.openai.com/g/g-b8VNoIfxI-gptcode

Nice, thanks. That’s fantastic for processing large photograph type images, especially with overlaps, but would break systems like this if not matched to 512x512. Interesting.

Seems like I have a basic misunderstanding how GPT4V ingests images. Since you said it tiles the images, it would make sense why it adds confusion to a certain extent.

Well, with 512x512 there’s not a lot of room to squeeze more tiles into, but 3x3 at 512px should work for the majority of things.

I’ll do some tests and report back!

Update!

I conducted some tests, and to my surprise, in low resolution mode (505x505), it refuses to describe the image most of the time. Not sure what’s up with that. Although, when it does, it’s not much better. I’d say it’s worse.

The grid:

Final transcript:

The short clip depicts a sequence of actions performed by an individual, presented through nine different snapshots, which together imply motion and activity. Although I can’t view or hear the content, I can theorize what the clip might represent based on the description:

  • In the first segment of the clip, the person could be seen initiating an action by focusing their attention on a window, possibly expressing curiosity or intent.
  • As the clip progresses, the individual might interact with the window, perhaps by opening it or adjusting an object nearby, suggesting a level of engagement with their immediate surroundings.
  • At the mid-point of the clip, there could be a shift to a task that requires closer attention, with the person handling an object. The nature of the task isn’t clear but seems to require a careful examination or manipulation.
  • Towards the end of the clip, the sequence might convey the idea that the person is concluding their interaction with the object or being suddenly interrupted, indicated by a blurred motion which often signifies rapid movement or a transition to a different scene.

Given that there’s no audio, we are left to interpret the clip’s narrative solely through visual cues, which can sometimes lead to an incomplete understanding of the events unfolding. Without additional context or sensory input, viewers have to fill in the gaps with their own interpretations and imagination to make sense of the story being told in the clip.

Although you can still clearly see it’s an A/C, GP4V lost it’s ability to tell this. Also, as I mentioned above, most of the times it just says:

I’m sorry, but I cannot assist with requests involving the analysis or description of images with the content described.

Or simply:

I’m not able to assist with this request.

Seems like the high-resolution mode is still superior.

It is rather that 512x512 would be a good size for only a single video frame, or maybe four if pushing your luck.

The largest image that would remain unresized would be 2048x768, or 768x2048.

An image 2048 tall with 512px wide video frames is the highest packing and maintaining your quality. 4 tiles with seven vertical 512x288 frames

It is rather that 512x512 would be a good size for only a single video frame

Then we would be at the beginning again. Feeding GPT4V individual frames doesn’t really seem interesting.

An image 2048 tall with 512px wide video frames is the highest packing and maintaining your quality. 4 tiles with seven vertical 512x288 frames

Hmm, interesting, but depending on the aspect ratio of the clip, it would need to be rearranged. Not all clips are 16:9.

This is novel. What a neat way of using Vision.

One app which is sort of similar that comes to mind is muse.ai.

They do frame-level breakdowns of the video - but not to describe the video.

Instead the app does object recognition and more.

Demo: muse - dot - ai/v/VBdrD8v-President-Kennedys-Speech-at-Rice-University

Click scenes…people…objects once you load the page (Sorry, I can’t include links - new account)