Hello folks, today I’ve made something so that you don’t have to.
I’ve tried feeding GPT-4-Vision a video - for cheap!
Normally you would transcribe most of the frames in a video and then summarize it with AI. Well, while this works to some extent, it’s can get very expensive, especially with longer clips. Also, it looks like my method is more efficient, but more on this below.
From my tests, I found out that GPT-4-Vision can read sequences of images in a single image, which allowed me to do this. I also used Whisper to increase the quality of the reading.
The short clip which I’ll feed into is this one: Well never thought of it like that #chile #fypシ゚viral #tiktok #fyp #sc... | TikTok (22 seconds)
The method
I wrote a script which splits the video into 9 frames and extracts the audio, using FFMPEG. It then makes a “photo grid” out of the frames. I’ve then fed the audio into Whisper to transcribe it, and immediately after that fed both into GPT-4-Vision using following prompt:
The image shows video frames in sequence. Describe what’s likely going on in each frame.
You can hear following in the audio track: ${audio_transcript}. Also mention what you can hear in the audio.
The wording and “likely” was needed to avoid “Can’t assist you with that” and other shenanigans. Seems that GPT4V doesn’t like to guess stuff.
“Also mention what you can hear in the audio.” was needed for it to not summarize or skip audio cues which might be important. It’s better to try and preserve them to use it in the next step.
Which spit out:
The image appears to be a sequence of comic strip panels featuring an interaction between a green pepper-like character and a human scientist with two human-like assistants.
Frame 1: The green pepper is speaking. Its speech bubble says, “To prevent animals from eating me, I developed an irritating chemical.”
Frame 2: The pepper continues to speak, and the dialogue in the speech bubble reads, “that makes me mildly uncomfortable to consume!”
Frame 3: The scientist, who is wearing a lab coat and glasses, starts to speak, addressing the pepper. The text reads, “We’ve selectively bred…”
Frame 4: The focus is on the scientist who is now holding a smaller red pepper-like character and continues the sentence with, “your offspring until one produced so much of that chemical.”
Frame 5: The sentence concludes with the scientist still holding the small red pepper, saying, “eating it causes severe physical and psychological distress.” The small red pepper greets the larger one cheerfully with a friendly “hello!”
Frame 6: This panel features multiple instances of the smaller red pepper, each with a stylized speech bubble saying “MUNCH MUNCH” around them, illustrating that the smaller red pepper is being eaten, presumably by humans.
The last two panels appear to have the word “MUNCH” repeated in various sizes, creating an overwhelming effect to match the audio from the track mentioned, indicating the extreme reaction caused by consuming the pepper loaded with the irritating chemical.
Based on the text in the speech bubbles and the audio described, the comic seems to be a satirical take on how natural defense mechanisms of plants (like the pepper’s irritating compound) have been exaggerated through selective breeding, leading to the creation of very hot or spicy varieties that can cause intense reactions when eaten. The audio saying “MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM” likely represents the sound of someone enjoying the consumption of the pepper despite the chemical’s unpleasant effects, which is an ironic twist to the narrative. The “HELLO” heard in the audio coincides with the friendly greeting of the red offspring pepper to the original green pepper character.
While this is already pretty good, let’s run it through GPT-4 and summarize it, using following prompt:
Explain what’s happening in the text, but this time refer to it as a short clip, not individual frames.
I’ve had to add “this time” to avoid getting “I’m not able to complete this request” for some reason. Also, wording matters, I didn’t use “Summarize” as it strips out a lot of useful information.
It spit out following:
In this short clip, we are presented with a satirical and anthropomorphized portrayal of selective plant breeding.
Initially, a character resembling a green pepper explains to a human scientist that it has developed an irritating chemical as a defense mechanism to prevent animals from consuming it. The pepper notes with a degree of pride that this chemical makes it “mildly uncomfortable to consume.”
As the clip progresses, the scientist, along with two assistants, informs the green pepper that humans have engaged in selective breeding with its species. The scientist reveals that they have amplified the very trait the pepper evolved for protection—its chemical irritant.
In a somewhat twisted turn of events, it’s shown that they have bred a smaller red pepper variant that contains such an intense amount of this irritant that consuming it now causes “severe physical and psychological distress.” This smaller pepper greets the original green one with a cheery “hello,” seemingly oblivious to the consequences of the breeding.
Throughout the clip, the sound that accompanies it intensifies—paralleling the increasing potency of the peppers. The red peppers are being consumed with zeal, as evidenced by the sound of “MUNCH MUNCH,” and a prolonged “MMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMMM,” which might signify someone relishing their spicy taste.
Ultimately, the short clip highlights the irony of how human intervention has taken a natural defense mechanism and turned it into a culinary challenge or enjoyment, with people sometimes seeking out the very discomfort plants evolved to deter predators. Despite the severe effects, there is a strange pleasure derived from these hot peppers, as suggested by the prolonged, satisfied sound in the audio.
As we can see, even though it couldn’t see all frames, the sequence of frames together with Whisper was sufficient for it to assemble the pieces together. It grasped stuff that could only be heard, not seen.
Although this method is a hit or miss, it works okay enough if the scene doesn’t change a lot. And even if it misses a scene, Whisper can act as a backup and “fill the gaps” to some extent.
It might fall short if no voice can be heard but only noises or other sounds, as Whisper doesn’t transcribe noises. Sometimes it does, but it’s pretty rare. My guess it wasn’t trained on this, but some samples of random transcribed sounds slipped in and made it into training.
Why am I sharing this?
At first, I was a bit hesitant of sharing this technique, as it’s a smart yet cheap technique, which will probably just be copied without crediting anyone, but I’ve still decided to do it for two key reasons:
-
While it works, GPT-4-Vision was likely not trained on a lot of “sequences” of images, which is apparent on a lot of clips when it just cannot grasp simple yet obvious things that changed from one frame to another. I’ll post an update on this.
Making this method more known hopefully makes OpenAI train their Vision LLM on more image sequences, so it can become even more feasible in the future. -
Whisper likes to hallucinate a lot when the audio is too quiet or simply silent, which throws the description off. Eg. if the audio is silent, I’ve seen it produce “Thanks for watching!” and similar stuff. When it combines this with the frames, it’ll try to reason why this makes sense - although it doesn’t. Maybe if more people use this technique, it gains sufficient attention so that someone fixes this. A Whisper v4 trained transcribing noises and other sounds would be incredible and add to the overall quality. One can dream.
Can I get better results?
I’ve experimented around with producing 5x5 grids, and similar, which also produces interesting results. However, if you put too much frames into one image, it can get overwhelmed and skip things. YMMV!
In all cases it works better when detail is set to high, so it can see the image in a higher resolution, and will allow you to squeeze more images into a single image.
Maybe if you add timestamped transcriptions to the vision model, it can more closely predict what’s going on. Needs to be tested!
Also, with longer clips it might make sense to generate multiple photo grids slightly interlaced so it can continue to “see” the progression, and then run it through a summarization step, like above.
Is this better than OpenAI’s recommendation transcribing individual frames?
From what I’ve seen, yes.
When you ask it to transcribe individual images, it can’t see movement or how the scene progresses, which makes it more prone to transcribe things over and over again, missing things that obviously moved in the video, yet are still there, or similar.
It also makes the summarization step much bigger, which will eat all your tokens as lunch.
How cheap is it?
From my tests, pretty cheap. To run all the following steps (complete transcription) on the above clip:
- Transcribing the audio using Whsiper v2 Large (API)
- Reading the grid using GPT-4-Vision (API)
- Summarization using GPT-4-Turbo (API)
I’ve used 3 cents, give or take.
If this somehow is still “too expensive” for your use, you could combine the vision and summarization step into one, although it’s possible that the quality wouldn’t be as high. YMMV! It looks like running it through 2 steps adds more reasoning to it and also skips less details, which is good.
Cheers!