Unexpected surge in cost: assistant + gpt-4-vision-preview

jeffrey.pinyan · June 27, 2024, 5:22pm

I am using the API to send an assistant (w/ gpt-4-vision-preview) a screenshot and the URL the screenshot is from. In the past 48 hours, I have noticed a huge spike in the cost of these queries.

We were spending $4-$5 a day, and suddenly we’re spending $20+. The number and size of the payloads does not appear to have changed appreciably in that time.

But looking at the billing details more closely, it thinks the number of context tokens has ballooned 10x:

If I’m not sending larger payloads, where is the spike in context tokens coming from?

Are the instructions for the assistant being counted every time a new thread is opened? Even if that were the case, the instructions didn’t grow in size in the past 48 hours.

jeffrey.pinyan · June 27, 2024, 5:34pm

I may have figured it out – some of the screenshots in the past 48 hours have been much larger than the previous average.

anon22939549 · June 27, 2024, 5:38pm

I was coming in to suggest this as you found your answer.

I’m general you should always optimize your input before passing it into expensive models.

Essentially, you should crop out or mask unimportant details and scale the image to some max dimension.

Edit: I wanted to add that one thing you might experiment with is Segment Anything from Meta (see also: https://segment-anything.com/).

The broad strokes idea is that you would first segment your image then only send the relevant bits to the model for vision processing.

If your application is processing many similar screenshots, you can get even better results by doing a fine-tune on the SAM.

jeffrey.pinyan · June 27, 2024, 5:43pm

Would you say that scaling the image down to a max height will hinder vision-preview’s ability to detect elements within the image (e.g. logos, text boxes)? I feel like I’m better off cropping the image to a max height than trying to scale it.

jeffrey.pinyan · June 27, 2024, 5:47pm

I just found { detail: 'low' } so I’ll try that first.

anon22939549 · June 27, 2024, 6:00pm

I think both cropping and scaling have their places.

For instance, if your image is too large, the API will scale it down invisibly in the background before sending it to the model,

For low res mode, we expect a 512px x 512px image. For high res mode, the short side of the image should be less than 768px and the long side should be less than 2,000px.

I don’t know which rescaling algorithm they are using in the background, but I can say it is almost always best to handle any preprocessing yourself so you can ensure the final inputs from the models are the best they can be.

This is partly why I recommended Segment Anything above. With it you can do interesting things like cut out individual elements and resize them to fit in a 512x512 box. This helps ensure individual elements are kept together in the same high-resolution chunk.

But, yeah, doing some form of cropping and downscaling to ensure the input images align with some number of 512x512 blocks will help keep your costs in check while maintaining as much quality as possible.

jeffrey.pinyan · June 27, 2024, 6:13pm

Thank you for the recommendations. We’re still fine-tuning this particular use case, but SAM sounds like it could be useful. We don’t know the contents of the screenshots ahead of time (that’s why we’re having chatgpt analyze them for us) but we know the elements we’re expecting to find, so SAM might work for us there.

anon22939549 · June 27, 2024, 6:22pm

Can I ask more details about the screenshots you’re working with?

I would guess you’re analyzing some kind of UI elements, website layouts, or something of that nature, but if I knew more about what you’re broadly trying to get the model to do I might have some other, more targeted, suggestions.

jeffrey.pinyan · June 27, 2024, 6:37pm

We are looking at screenshots of web sites that have been reported for phishing. So we’re having ChatGPT analyze them for logos, username/password/etc. text-boxes, etc.

anon22939549 · June 27, 2024, 7:31pm

ChatGPT or through the API?

How are the sites being reported—are users submitting the screenshots themselves or are you going to the URL with a headless browser and capturing the screen as a rendered image?

jeffrey.pinyan · June 27, 2024, 7:36pm

Sorry, through the API.

Sometimes the screenshot is provided by the person reporting the URL, and sometimes it is grabbed via headless browser.

anon22939549 · June 27, 2024, 7:47pm

Question…

I’m sure you’ve considered this, but I’m wondering if you can explain the reasoning to me so I’m up to speed.

If you’ve got the ability to access the page, directly and some of the chief items you are concerned with evaluating are logos, why are you not grabbing the page images directly and evaluating those?

Topic		Replies	Views
Question: GPT Vision + Appending to the messages object for sequential requests API gpt-4 , gpt-4-vision , gpt4-vision	6	1896	February 2, 2024
I am paying +$1 for a single request on analysing a 200kb image API gpt-4	5	698	June 1, 2024
GPT Assistant talks about their task or just posts an example instead of actually performing the task Prompting gpt-4	3	984	November 28, 2023
Getting data from other peoples images on vision API Bugs gpt-4	1	101	August 17, 2024
Make OpenAI Vision API Match GPT4 Vision API chatgpt	4	4064	December 6, 2023

Unexpected surge in cost: assistant + gpt-4-vision-preview

Related topics