Video analysis with Open AI

I am trying to analyze a video that I made with my drone.
The video has no voice but rather records activities only.
What I try to get is a summary of what happened in the video.
08:01 Person leaves house and walks to garden shed
08:05 Person takes equipment out of the garden shed and walks into garden area
08:10 Person does some work in the garden
Is it feasible to get this information?

Here’s a cookbook example, which is overly optimistic about current vision models and how many pictures they can accept, and uses a method retired in newest models, but gives an overview of frame extraction and asking.

Many more frames need to be discarded to accommodate a budget, and timestamping is not a feature except by what you know about the segment you sent for analysis of video images.

A second round of AI processing may be required to remove the redundancy of what is reported in an image, as there is no actual long-term view of a video, only creative providing of images.


Many thanks for the answer. Seems that I am asking for something that is currently not easily achievable. Let’s wait a couple of months and then see what is possible.