Reading videos with GPT4V

Thanks @moonlockwood and @kjordan !

The update


Where it falls short

Let’s take following clip: Reddit - Dive into anything (9 seconds)

I’ve laughed way too hard at this clip, but the AI can’t comprehend what’s happening in it, even while it’s clear as day that the air conditioner just vanished from one frame to another.

Final transcription:

In the short clip, it appears to depict a man interacting with an air conditioning unit in a room. Initially, he stands near the unit, looking intrigued or slightly confused. He reaches towards and possibly fiddles with the machine, as if trying to activate or adjust it. Shortly afterward, he steps back, presumably expecting the air conditioner to start functioning. However, his expression quickly turns to one of shock or surprise, and we observe him starting to back away from the unit, seemingly in response to an unexpected occurrence, like a loud noise or a forceful blast of air.

As the man becomes more flustered, he retreats farther from the air conditioner, now in a defensive posture. This suggests the unit may be malfunctioning or has produced an effect he did not anticipate. Almost immediately, he turns and swiftly moves away from the unit, reinforcing the idea that something has gone wrong. He leaves the vicinity with a look of alarm on his face, glancing back at the air conditioner with concern as if to verify the source of the problem.

Throughout this sequence, the clip shows the man’s quick transition from curiosity to surprise, and finally to a state of alarm, ending with him hastily exiting the scene. Given there is no audio provided, it is not possible to infer additional context from any sounds that might be related to the clip.

Even feeding the clip through GPT4V multiple times doesn’t change the outcome much. It still thinks the air conditioner is there and the man is somehow interacting with it.

Where Whisper confuses GPT4V

This time, let’s use a short clip which just shows a cat come closer and closer to the camera. The audio clip is silent, but exists. Here’s where Whisper’s problem emerges.

The final transcription:

Based on the description of the short clip, we can surmise the following narrative and context:

The clip begins with a cat cautiously emerging from its hiding place, perhaps behind some furniture or a doorway, exhibiting signs of curiosity or wariness. As the clip progress, the cat’s movements forward and facial expressions evolve, suggesting a heightening of its interest or possible surprise in response to an unseen stimulus, maybe a human interaction, another animal, or an unexpected noise.

The central portion of the clip focuses on a close-up of the cat’s face, emphasizing its wide-eyed stare, which could indicate a variety of feline emotions such as surprise, intense curiosity, or even mild alarm. The cat holds its gaze steadily, likely captivated by whatever it is observing.

Towards the end of the clip, we are presented with very similar scenes where the cat remains deeply concentrated on the subject of its attention. It doesn’t display obvious signs of readiness to pounce or flee, suggesting that its interest is piqued but not to the point of invoking a strong fight or flight response.

The background audio mentions “Subtitles provided by DimaTorzok”, indicating the presence of translated text accompaniment, commonly included for viewers to understand spoken dialogue or commentary not in their native language.

Overall, the clip appears to capture a candid moment of a cat’s typical behavior when faced with something intriguing or unexpected, highlighting the animal’s capacity for focused attention and the classic feline mix of caution and curiosity."

As you can see, it hallucinates:

Subtitles provided by DimaTorzok

It sometimes tries to reason pretty deep why someone suddenly says “Like and Subscribe” in a random video, although nobody says that and it is just a hallucination.

Other thing’s that I’ve saw were:

  • Thanks for watching
  • Subtitles by …
  • Like and Subscribe (or any similar variation)
  • Subtitles provided by …

That sounds like uncurated training data to me. They likely used all sorts of transcriptions from videos from several platforms and things like those that were at the end of these videos slipped into training. Since those were silent parts, they emerge on silent audio.

It’s possible to filter that out with a GPT-3.5 step. However, if it happens that someone actually says that, GPT-3.5 would wrongfully filter it out. One thing that could work to some extent is maybe filter out single phrases like these and only process the transcription if it’s longer than X words.

4 Likes