Image Analysis Options: GPT-4 Too Expensive

matthewethan · January 10, 2024, 2:44pm

I am currently using GPT 4-Vision to “analyze” videos. I split a video into frames (one frame per second), and feed it to GPT so that it can “watch” and “analyze” my video.

This method is expensive as I am sending 100,000s of images per day. How can I reduce cost? The image detail is already set to ‘low’.

Is there something cheaper I can use here? A different image analysis tool? Maybe even a video analysis tool?

trenton.dambrowitz · January 10, 2024, 2:48pm

Processing and narrating a video with GPT’s visual capabilities and the TTS API | OpenAI Cookbook

Once we have the video frames, we craft our prompt and send a request to GPT (Note that we don’t need to send every frame for GPT to understand what’s going on):

Have you experimented with skipping a certain amount of frames? Sending a frame every other second instead of every second would immediately half your costs.

Some more context on your use-case might be helpful to determine what options you have.

matthewethan · January 10, 2024, 2:53pm

The goal is to look for a specific moment in a video and return what time it happens.

I do this by sending one frame from each second of the video to GPT, asking it if this frame represents the moment we’re looking for.

I put all the relevant frames into a list, and split it into sequences of relevant frames (ex. [1, 2, 3, 8, 9, 10] → [[1, 2, 3], [8, 9, 10]])

I then ask it which sequence best describes the moment I am looking for.

_j · January 10, 2024, 3:02pm

You can get down to the video codec level: re-encode without a forced frame type rate and just extract i-frames or b-frame that indicate a scene change. Might work better on “install an AC unit” video instead of music videos.

There’s not getting lower than the 85 tokens per image input. You can send multiple images per request, lowering the total prompt per image, and requesting more summarized results of similar scenes and identification of scene changes.

matthewethan · January 10, 2024, 3:06pm

Interesting. So I can split the video by frames that represent a scene change?

vb · January 10, 2024, 3:11pm

Yes, I think the newest editions of Adobe’s Premiere Pro do have such a feature build-in.
So it could be an option to look into modern video editing software. Most of these features are advertised as AI powered which could help you find the solutions faster.

_j · January 10, 2024, 3:41pm

The codec-based splitting would be getting deep into video encoding using libraries, but employs the intelligence that advanced encoders like MPEG4 already have. You can certainly use other code-based techniques to find the middle of sequences between shot changes based on contents.

matthewethan · January 10, 2024, 5:50pm

I wonder how this was worked if I was looking for a moment in a video game. Ex. a certain boss fight in a 4 hour long gameplay

_j · January 10, 2024, 6:00pm

Looking for the appearance of a boss in a video game?

If you use the algorithm of the topic creator, it would be asking for each of 60 seconds x 60 minutes x 4 hours of frames “is this a fight with the boss level” and gambling the vision AI would have a clue or has been trained on recognizing that imagery from labeled data.

vb · January 10, 2024, 6:03pm

If you know how the scene will look like you can try OpenCV.

It should do the trick for you and there are a lot of resources to get you going.
It’s also a cheap option.

matthewethan · January 10, 2024, 6:05pm

algorithm of the topic creator

What does this mean?

60 seconds x 60 minutes x 4 hours of frames “is this a fight with the boss level” and gambling the vision AI would have a clue or has been trained on recognizing that imagery from labeled data.

Yeah, you can see how it gets expensive, fast. With GPT 4-Vision it has been accurate at least

matthewethan · January 10, 2024, 6:06pm

Oh, so your idea is to upload an image of what the moment looks like beforehand, and then use OpenCV to match a frame with the uploaded image?

vb · January 10, 2024, 6:08pm

Yes, that’s how it works.
You can also train your own model like YOLO in case the scene looks a bit different every time and use it in conjunction with OpenCV.

matthewethan · January 10, 2024, 6:10pm

Hopefully I can find a way to make this work.
Ideally I’d like to be able to take any random video and ask the program to find any random moment in it.
Edit: I currently have a program that does this (takes random moment and random video as input and finds what time it happens) but with GPT 4-Vision and getting a frame from every second, it’s so expensive

_j · January 10, 2024, 6:33pm

Each new post in this forum is referred to as a “topic”, with followup “replies”.

The topic creator (the “OP”), at the beginning is describing sending one image frame per video second as their “algorithm”.

You gave proper analysis of the expense of doing so.

_j · January 10, 2024, 6:38pm

Audio transcription may be much more in line with the more robust techniques of language AI processing. 10 cents for 16 minutes of audio->text.

You could split by silence, send the chunks that you’ve labeled by source time metadata, and see when the AI says “there it is!” to the resulting transcript pieces (which can be further employed).

matthewethan · January 11, 2024, 7:54pm

Would this work? Unfortunately when parsing through things such as gameplay, I don’t think there would be audio to help identify what’s going on on-screen.

_j · January 11, 2024, 11:40pm

If you rely on voice transcription, and there is no spoken word, I would have to conclude that it would not work…

Topic		Replies	Views
Reading videos with GPT4V API gpt-4	39	28014	December 1, 2024
TTS API service usability API tts	17	6186	December 16, 2023
New Realtime API voices and cache pricing Announcements realtime , prompt-caching	26	3076	November 27, 2024
I don't understand the pricing for the realtime API API realtime	33	6807	October 8, 2024

Image Analysis Options: GPT-4 Too Expensive

Related topics