You can get down to the video codec level: re-encode without a forced frame type rate and just extract i-frames or b-frame that indicate a scene change. Might work better on “install an AC unit” video instead of music videos.
There’s not getting lower than the 85 tokens per image input. You can send multiple images per request, lowering the total prompt per image, and requesting more summarized results of similar scenes and identification of scene changes.
Yes, I think the newest editions of Adobe’s Premiere Pro do have such a feature build-in.
So it could be an option to look into modern video editing software. Most of these features are advertised as AI powered which could help you find the solutions faster.
The codec-based splitting would be getting deep into video encoding using libraries, but employs the intelligence that advanced encoders like MPEG4 already have. You can certainly use other code-based techniques to find the middle of sequences between shot changes based on contents.
Looking for the appearance of a boss in a video game?
If you use the algorithm of the topic creator, it would be asking for each of 60 seconds x 60 minutes x 4 hours of frames “is this a fight with the boss level” and gambling the vision AI would have a clue or has been trained on recognizing that imagery from labeled data.
Hopefully I can find a way to make this work.
Ideally I’d like to be able to take any random video and ask the program to find any random moment in it.
Edit: I currently have a program that does this (takes random moment and random video as input and finds what time it happens) but with GPT 4-Vision and getting a frame from every second, it’s so expensive