I am currently using GPT 4-Vision to “analyze” videos. I split a video into frames (one frame per second), and feed it to GPT so that it can “watch” and “analyze” my video.
This method is expensive as I am sending 100,000s of images per day. How can I reduce cost? The image detail is already set to ‘low’.
Is there something cheaper I can use here? A different image analysis tool? Maybe even a video analysis tool?
Once we have the video frames, we craft our prompt and send a request to GPT (Note that we don’t need to send every frame for GPT to understand what’s going on):
Have you experimented with skipping a certain amount of frames? Sending a frame every other second instead of every second would immediately half your costs.
Some more context on your use-case might be helpful to determine what options you have.
You can get down to the video codec level: re-encode without a forced frame type rate and just extract i-frames or b-frame that indicate a scene change. Might work better on “install an AC unit” video instead of music videos.
There’s not getting lower than the 85 tokens per image input. You can send multiple images per request, lowering the total prompt per image, and requesting more summarized results of similar scenes and identification of scene changes.
Yes, I think the newest editions of Adobe’s Premiere Pro do have such a feature build-in.
So it could be an option to look into modern video editing software. Most of these features are advertised as AI powered which could help you find the solutions faster.
The codec-based splitting would be getting deep into video encoding using libraries, but employs the intelligence that advanced encoders like MPEG4 already have. You can certainly use other code-based techniques to find the middle of sequences between shot changes based on contents.
Looking for the appearance of a boss in a video game?
If you use the algorithm of the topic creator, it would be asking for each of 60 seconds x 60 minutes x 4 hours of frames “is this a fight with the boss level” and gambling the vision AI would have a clue or has been trained on recognizing that imagery from labeled data.
60 seconds x 60 minutes x 4 hours of frames “is this a fight with the boss level” and gambling the vision AI would have a clue or has been trained on recognizing that imagery from labeled data.
Yeah, you can see how it gets expensive, fast. With GPT 4-Vision it has been accurate at least
Yes, that’s how it works.
You can also train your own model like YOLO in case the scene looks a bit different every time and use it in conjunction with OpenCV.
Hopefully I can find a way to make this work.
Ideally I’d like to be able to take any random video and ask the program to find any random moment in it.
Edit: I currently have a program that does this (takes random moment and random video as input and finds what time it happens) but with GPT 4-Vision and getting a frame from every second, it’s so expensive
Audio transcription may be much more in line with the more robust techniques of language AI processing. 10 cents for 16 minutes of audio->text.
You could split by silence, send the chunks that you’ve labeled by source time metadata, and see when the AI says “there it is!” to the resulting transcript pieces (which can be further employed).
Would this work? Unfortunately when parsing through things such as gameplay, I don’t think there would be audio to help identify what’s going on on-screen.