Reading videos with GPT4V

Regardless of your size it will tile them (unless it’s already 512x512)

If it’s high-res it will first down-scale it to 512x512 for an overview, then it will resize the high-res image and start slicing/tiling it for inspection.

You can actually see how it will format the image in the pricing calculator:

In any case it makes sense to scale the image yourself and not rely on it to automatically do it.

In regards to your whisper issue I’d recommend using a VAD (Voice Activity Detector) alongside Whisper to mark moments of silence and slice it out.

I’m enjoying watching your progress! It’s a great idea!

1 Like

No wonder even GPT4 can’t properly describe what happened on film.

Show a series of these pictures to a random person and they too will have difficulty describing what happened in the video. There is a key piece of information missing here, and that is that the air conditioner fell out of the window. This is not clear in the pictures, whereas it is clear in the video.

The solution here might be to take frames from the film, not at equal intervals, but from one key event to the next.

2 Likes

I wonder in this way, that may not equal to the video input, since there is no causal modeling on time if we input all images in single round.

@Fusseldieb - I found this while googling, because I had a similar idea to take out scenecuts/keyframes and also add timestamps to the image. The idea with a grid is improving it a lot.

I work on an open source module for the CMS Drupal that has different AI componentets to build full workflows. I just added this as a workflow Video to Text.

For anyone wanting to replicate this, you can see the FFMpeg commands in the file on row 298 here:
git. drupalcode .org/project/ai_interpolator_openai/-/blob/1.0.x/src/Plugin/AiInterpolatorFieldRules/OpenAiVideoToText.php?ref_type=heads#L298
(Sorry, I’m not allowed to link yet).

I also created a video of how to setup this using the GUI and an example result here if anyone wants to replicate it:


workflows-of-ai . com/workflow/improve-media-3

It requires you to know how to install Drupal though.

This is a brilliant strategy. Im working on a mobile app that could use this feature and wondered if you can somehow help me implement the code behind your strategy in a buildship node or custom function within openai assistants?

Hi @Meistro ,
I would personally wait until GPT-4o is fully out, so we can take advantage of the new Audio model, which seems, at least from the presentations, worlds better than Whisper.

1 Like

Are you opposed to helping me anyways? Id like to expedite production and then upgrade model once code structure is in place. just having trouble converting openai tutorial from python to js to work in buildship. Thanks!

@Meistro
I’m not particularly familiar with Buildship, but I still want you to make it happen.

Which issues are you facing? Maybe I can help you out regardless.

As to not pollute this thread and sending everyone e-mails with our back-and-forth, let’s chat over DMs if you’re interested.
Cheers!

Regarding the air conditioner video, I tried explaining it with a total of 23 frames, including keyframes and a few surrounding frames, and it seems to have worked well.

The image shows a grid of sequential photos of a person performing a trick with a hat, spinning it from his foot to his head. (Captioned by AI)

The sequence of images shows a man interacting with a window air conditioning unit. Here’s a breakdown of the key frames and why this might be funny:

  1. First few frames: The man is seen standing next to a window air conditioning unit, seemingly adjusting or inspecting it.
  2. Middle frames: The air conditioning unit suddenly falls out of the window, leaving a gaping hole. The man’s expression and body language suggest surprise and confusion.
  3. Last few frames: The man looks at the empty window frame, then turns away, possibly in disbelief or frustration.

The humor in this situation likely comes from the unexpected and sudden nature of the air conditioning unit falling out of the window. The man’s reaction adds to the comedic effect, as it captures a moment of surprise and helplessness. This type of physical comedy, where something goes wrong in an unexpected way, is a common source of humor.


If anyone wants to know the specific system messages or how I did it, please let me know.

2 Likes

Hi, I want to know if this is the transmission of multiple images or a single image? In addition, what interface is used, gpt-4v or gpt-4o?

Thank you for your question.
I presented this as multiple images to the model. The model used was GPT-4o.

I extracted the I-frames and the 2 frames before and after each I-frame, encoded them as base64, and used GPT-4o’s vision capabilities.

2 Likes

I’ve made a similar example integrating TTS as well to narrate the video and select 5 frames of a video, you can check out the prompts and example I gave: huggingface[dot]/spaces/Pixeltable/AI-Video-Analyzer-GTP4-Vision-TTS-Narration

@dignity_for_all - I would love that. I can’t get 4o to understand the sequence of events here. It thinks curry is about to take a shot. If you can share how you’re helping gpt understand sequence and direction of events etc, that would be helpful!

1 Like

Hey @dignity_for_all - what’s the best way to connect with you?

@temiwale88 you can try that:


import pixeltable as pxt
from pixeltable.iterators import FrameIterator
from pixeltable.functions import openai, image
from datetime import datetime
import math
import os

# Initialize Pixeltable
pxt.create_dir('video_processor')

# Create main video table
video_table = pxt.create_table(
    'video_processor.videos',
    {
        "video": pxt.Video,
        "timestamp": pxt.Timestamp,
    }
)

# Create frames view with iterator
frames_view = pxt.create_view(
    'video_processor.frames',
    video_table,
    iterator=FrameIterator.create(video=video_table.video, fps=1)
)

# Add computed column for frame encoding
frames_view['encoded_frame'] = image.b64_encode(frames_view.frame)

# Insert video
video_table.insert([{
    "video": "path/to/your/video.mp4",
    "timestamp": datetime.now(),
}])

# Get frames and select representative ones
frames = frames_view.select(frames_view.encoded_frame).collect()
frame_list = [f["encoded_frame"] for f in frames]

# Select 5 evenly spaced frames
total_frames = len(frame_list)
if total_frames > 5:
    interval = total_frames / 5
    selected_indices = [math.floor(i * interval) for i in range(5)]
    selected_frames = [frame_list[i] for i in selected_indices]
else:
    selected_frames = frame_list

# Create content for GPT-4 Vision
frame_content = [
    {
        "type": "text",
        "text": "This is a sequence of frames from a video. Please analyze the overall action and content across all frames:"
    }
]

for i, frame in enumerate(selected_frames, 1):
    frame_content.extend([
        {
            "type": "text",
            "text": f"Frame {i}:"
        },
        {
            "type": "image_url",
            "image_url": {
                "url": f"data:image/jpeg;base64,{frame}"
            }
        }
    ])

# Set up GPT-4 Vision analysis
messages = [
    {
        'role': 'system',
        'content': "Analyze this video sequence and describe the key events and content you observe."
    },
    {
        'role': 'user',
        'content': frame_content
    }
]

# Add GPT-4 Vision analysis as computed column
video_table['response'] = openai.chat_completions(
    messages=messages,
    model='gpt-4o',
    max_tokens=500
)

# Extract content
video_table['content'] = video_table.response.choices[0].message.content.astype(pxt.StringType())

# Generate audio narration as a computed column
@pxt.udf
def generate_voiceover(script: str) -> str:
    import requests
    import tempfile
    
    response = requests.post(
        "https://api.openai.com/v1/audio/speech",
        headers={"Authorization": f"Bearer {os.environ['OPENAI_API_KEY']}"},
        json={
            "model": "tts-1",
            "input": script,
            "voice": "onyx",
        }
    )
    
    temp_audio_path = os.path.join(tempfile.gettempdir(), "voiceover.mp3")
    with open(temp_audio_path, 'wb') as f:
        f.write(response.content)
    
    return temp_audio_path

video_table['audio_path'] = generate_voiceover(video_table.content)

# Get results
results = video_table.select(
    video_table.content,
    video_table.audio_path
).tail(1)

content = results['content'][0]
audio_path = results['audio_path'][0]

print("Content:", content)
print("\nAudio saved to:", audio_path)

@pbrunelle - Thank you! This code is very useful downstretch. For now, gpt-4o is struggling with spatial reasoning. It still thinks the warriors are about to score on the frame I shared vs. they are moving the ball down the court.

1 Like

How can we use GPT4V to understand a live stream of video and look for certain events ?

Ideally for a task like this you wouldn’t dedicate GPT4V to the full task if it doesn’t work out the box. Even if it did, it would probably cost less to just hire someone to do this for you.

You can struggle for a prompt but the reality is that the controls available are hopelessly insufficient.

Instead, the first step to focus on would be object detection for people and the basketball. THEN maybe once you are capable of identifying things you can pass it to GPT4V for easier understanding.

But, as you’ve noted, GPTV and language models in general are not good at spatial reasoning.

In this case, I would focus more on the mapping/trajectories of the basketball, and which team has possession of it.

Amazing work. Posted a link to this on my socials. (@xklob)

Thanks for this insight. It has become clear that I’d need to augment the instructions with some context about the photo before passing it to GPT4.