Image Analysis Options: GPT-4 Too Expensive

I am currently using GPT 4-Vision to “analyze” videos. I split a video into frames (one frame per second), and feed it to GPT so that it can “watch” and “analyze” my video.

This method is expensive as I am sending 100,000s of images per day. How can I reduce cost? The image detail is already set to ‘low’.

Is there something cheaper I can use here? A different image analysis tool? Maybe even a video analysis tool?

Processing and narrating a video with GPT’s visual capabilities and the TTS API | OpenAI Cookbook

Once we have the video frames, we craft our prompt and send a request to GPT (Note that we don’t need to send every frame for GPT to understand what’s going on):

Have you experimented with skipping a certain amount of frames? Sending a frame every other second instead of every second would immediately half your costs.

Some more context on your use-case might be helpful to determine what options you have.

2 Likes

The goal is to look for a specific moment in a video and return what time it happens.

I do this by sending one frame from each second of the video to GPT, asking it if this frame represents the moment we’re looking for.

I put all the relevant frames into a list, and split it into sequences of relevant frames (ex. [1, 2, 3, 8, 9, 10] → [[1, 2, 3], [8, 9, 10]])

I then ask it which sequence best describes the moment I am looking for.

You can get down to the video codec level: re-encode without a forced frame type rate and just extract i-frames or b-frame that indicate a scene change. Might work better on “install an AC unit” video instead of music videos.

There’s not getting lower than the 85 tokens per image input. You can send multiple images per request, lowering the total prompt per image, and requesting more summarized results of similar scenes and identification of scene changes.

3 Likes

Interesting. So I can split the video by frames that represent a scene change?

1 Like

Yes, I think the newest editions of Adobe’s Premiere Pro do have such a feature build-in.
So it could be an option to look into modern video editing software. Most of these features are advertised as AI powered which could help you find the solutions faster.

2 Likes

The codec-based splitting would be getting deep into video encoding using libraries, but employs the intelligence that advanced encoders like MPEG4 already have. You can certainly use other code-based techniques to find the middle of sequences between shot changes based on contents.

2 Likes

I wonder how this was worked if I was looking for a moment in a video game. Ex. a certain boss fight in a 4 hour long gameplay

Looking for the appearance of a boss in a video game?

If you use the algorithm of the topic creator, it would be asking for each of 60 seconds x 60 minutes x 4 hours of frames “is this a fight with the boss level” and gambling the vision AI would have a clue or has been trained on recognizing that imagery from labeled data.

1 Like

If you know how the scene will look like you can try OpenCV.

It should do the trick for you and there are a lot of resources to get you going.
It’s also a cheap option.

1 Like

algorithm of the topic creator

What does this mean?

60 seconds x 60 minutes x 4 hours of frames “is this a fight with the boss level” and gambling the vision AI would have a clue or has been trained on recognizing that imagery from labeled data.

Yeah, you can see how it gets expensive, fast. With GPT 4-Vision it has been accurate at least

Oh, so your idea is to upload an image of what the moment looks like beforehand, and then use OpenCV to match a frame with the uploaded image?

1 Like

Yes, that’s how it works.
You can also train your own model like YOLO in case the scene looks a bit different every time and use it in conjunction with OpenCV.

Hopefully I can find a way to make this work.
Ideally I’d like to be able to take any random video and ask the program to find any random moment in it.
Edit: I currently have a program that does this (takes random moment and random video as input and finds what time it happens) but with GPT 4-Vision and getting a frame from every second, it’s so expensive

Each new post in this forum is referred to as a “topic”, with followup “replies”.

The topic creator (the “OP”), at the beginning is describing sending one image frame per video second as their “algorithm”.

You gave proper analysis of the expense of doing so.

Audio transcription may be much more in line with the more robust techniques of language AI processing. 10 cents for 16 minutes of audio->text.

You could split by silence, send the chunks that you’ve labeled by source time metadata, and see when the AI says “there it is!” to the resulting transcript pieces (which can be further employed).

Would this work? Unfortunately when parsing through things such as gameplay, I don’t think there would be audio to help identify what’s going on on-screen.

If you rely on voice transcription, and there is no spoken word, I would have to conclude that it would not work…

1 Like