How to summarise a lengthy video using image frames greater than 20 using gpt-4o

I want to summarise a video recording which is of one hour duration.There are lot many image frames.The limit of images per request is 20 for gpt-4o.How to process multiple images of above 20 and then summarise the video

Hi!
The general approach with video summarization is to reduce the number of frames that you send to the model.
This is cost and time efficient because you won’t lose any relevant information between second 3 frame 14 to second 3 frame 15 most of the time anyways.

1 Like

A 1-hour video would be challenging due to the limit of 250 images that can be presented to the model.

I have devised a code that works on Google Colab, based on the OpenAI Cookbook.

This Python script summarizes a video, for instance, from YouTube:

The video should be downloaded beforehand and specified with the path.

The video’s length is 8 minutes and 2 seconds, from which 250 frames will be extracted and presented to the model.

 !pip install -qqq opencv-python
 !pip install -qqq IPython
 !pip install -qqq openai


import cv2
import base64

cap = cv2.VideoCapture('path_to_video.mq4')
total_frames = int(cap.get(cv2.CAP_PROP_FRAME_COUNT))
max_frame = 248
if total_frames > max_frame:
    interval = total_frames // max_frame
else:
    interval = 1
base64Frames = []
frame_counter = 0
while True:
    ret, frame = cap.read()
    if not ret:
        break
    if frame_counter % interval == 0:
        _, buffer = cv2.imencode(".jpg", frame)
        base64Frames.append(base64.b64encode(buffer).decode("utf-8"))    
    frame_counter += 1
cap.release()


// Here you can play the video at a speed of 3 frames per second.
import time
import base64
from IPython.display import display, Image, clear_output

display_handle = display(None, display_id=True)

for img in base64Frames:
    try:
        decoded_img = base64.b64decode(img.encode("utf-8"))
        display_handle.update(Image(data=decoded_img))
    except Exception as e:
        print(f"Error displaying image: {e}")
    time.sleep(1/3)
    clear_output(wait=True)

// present the video frames to the model and generate a summary.
from openai import OpenAI
from google.colab import userdata
client = OpenAI(api_key=userdata.get('OPENAI_API_KEY'))
PROMPT_MESSAGES = [
    {
        "role": "user",
        "content": [
            "These are frames from a video that I want to upload. Generate a compelling description that I can upload along with the video.",
            *map(lambda x: {"image": x, "resize": 768}, base64Frames),
        ],
    },
]
params = {
    "model": "gpt-4o",
    "messages": PROMPT_MESSAGES,
    "max_tokens": 500,
}

result = client.chat.completions.create(**params)
print(result.choices[0].message.content)

I hope it can be of help, even just a little bit.

1 Like

I have made slight modifications to the previous Python script to extract only I-frames so that they fit within just under 250 frames.

I have also attached it here.

For instance, the following YouTube video is 32 minutes long, but it can accurately output a summary. It would likely work for a video of about one hour as well.

I assume this will run on Google Colab. Please ensure the video file is placed in the root directory beforehand.

// In the root directory where the video file is located, we will extract and expand the images from only the I-frames.


!ffmpeg -i "/path_to_your_video.mp4" -vf "select='eq(pict_type\,PICT_TYPE_I)'" -vsync vfr -f image2 "/frame-%04d.jpg"

// Adjust the extracted I-frames to fit within just under 250 frames and convert them into BASE64-encoded image objects.


import cv2
import base64
import os

input_folder = '/content/'
output_base64 = []
frames = [f for f in os.listdir(input_folder) if f.endswith('.jpg')]
total_frames = len(frames)
target_frames = 248
if total_frames > target_frames:
    interval = round(total_frames / target_frames)
else:
    interval = 1
selected_frames = frames[::interval][:target_frames]
for frame in selected_frames:
    frame_path = os.path.join(input_folder, frame)
    img = cv2.imread(frame_path)
    _, buffer = cv2.imencode('.jpg', img)
    encoded_string = base64.b64encode(buffer).decode('utf-8')
    output_base64.append(encoded_string)

// Here, the extracted I-frames can be displayed every 1/3 second.

import time
import base64
from IPython.display import display, Image, clear_output

display_handle = display(None, display_id=True)

for img in output_base64:
    try:
        decoded_img = base64.b64decode(img.encode("utf-8"))
        display_handle.update(Image(data=decoded_img))
    except Exception as e:
        print(f"Error displaying image: {e}")
    time.sleep(1/3)
    clear_output(wait=True)

// Reduce the size of the image object so that it does not exceed 250 MB.

from openai import OpenAI
from google.colab import userdata
client = OpenAI(api_key=userdata.get('OPENAI_API_KEY'))
PROMPT_MESSAGES = [
    {
        "role": "user",
        "content": [
            "These are frames from a video that I want to upload. Generate a compelling description with some emoji so I can upload it with the video.",
            *map(lambda x: {"image": x, "resize": 480}, output_base64),
        ],
    },
]
params = {
    "model": "gpt-4o",
    "messages": PROMPT_MESSAGES,
    "max_tokens": 500,
}

result = client.chat.completions.create(**params)
print(result.choices[0].message.content)

The output result is as follows:


:robot: Dive into the future with our latest video on AI and robotics! :rocket: Discover the power of Tesla’s cutting-edge robots, how neural networks function, and the role of AI in shaping industries. :globe_with_meridians: From trading strategies to health monitoring, see how artificial intelligence is revolutionizing our world. Don’t miss the neural network breakdown and the latest tech innovations driving the future! :star2:

ai robotics #NeuralNetworks #Tesla #Technology #Innovation #FutureTech #ArtificialIntelligence


I think it’s sufficient if we can summarize a one-hour video. If it’s too difficult to summarize an entire hour in one go, we could consider dividing it into the first and second halves.

You can use the google transcript API to retrieve the transcript for any narrated video.

1 Like