Unexpected surge in cost: assistant + gpt-4-vision-preview

I am using the API to send an assistant (w/ gpt-4-vision-preview) a screenshot and the URL the screenshot is from. In the past 48 hours, I have noticed a huge spike in the cost of these queries.

We were spending $4-$5 a day, and suddenly we’re spending $20+. The number and size of the payloads does not appear to have changed appreciably in that time.

But looking at the billing details more closely, it thinks the number of context tokens has ballooned 10x:


If I’m not sending larger payloads, where is the spike in context tokens coming from?

Are the instructions for the assistant being counted every time a new thread is opened? Even if that were the case, the instructions didn’t grow in size in the past 48 hours.

I may have figured it out – some of the screenshots in the past 48 hours have been much larger than the previous average.

1 Like

I was coming in to suggest this as you found your answer.

I’m general you should always optimize your input before passing it into expensive models.

Essentially, you should crop out or mask unimportant details and scale the image to some max dimension.

Edit: I wanted to add that one thing you might experiment with is Segment Anything from Meta (see also: https://segment-anything.com/).

The broad strokes idea is that you would first segment your image then only send the relevant bits to the model for vision processing.

If your application is processing many similar screenshots, you can get even better results by doing a fine-tune on the SAM.

1 Like

Would you say that scaling the image down to a max height will hinder vision-preview’s ability to detect elements within the image (e.g. logos, text boxes)? I feel like I’m better off cropping the image to a max height than trying to scale it.

I just found { detail: 'low' } so I’ll try that first.

I think both cropping and scaling have their places.

For instance, if your image is too large, the API will scale it down invisibly in the background before sending it to the model,

For low res mode, we expect a 512px x 512px image. For high res mode, the short side of the image should be less than 768px and the long side should be less than 2,000px.

I don’t know which rescaling algorithm they are using in the background, but I can say it is almost always best to handle any preprocessing yourself so you can ensure the final inputs from the models are the best they can be.

This is partly why I recommended Segment Anything above. With it you can do interesting things like cut out individual elements and resize them to fit in a 512x512 box. This helps ensure individual elements are kept together in the same high-resolution chunk.

But, yeah, doing some form of cropping and downscaling to ensure the input images align with some number of 512x512 blocks will help keep your costs in check while maintaining as much quality as possible.


Thank you for the recommendations. We’re still fine-tuning this particular use case, but SAM sounds like it could be useful. We don’t know the contents of the screenshots ahead of time (that’s why we’re having chatgpt analyze them for us) but we know the elements we’re expecting to find, so SAM might work for us there.

1 Like

Can I ask more details about the screenshots you’re working with?

I would guess you’re analyzing some kind of UI elements, website layouts, or something of that nature, but if I knew more about what you’re broadly trying to get the model to do I might have some other, more targeted, suggestions.

We are looking at screenshots of web sites that have been reported for phishing. So we’re having ChatGPT analyze them for logos, username/password/etc. text-boxes, etc.

ChatGPT or through the API?

How are the sites being reported—are users submitting the screenshots themselves or are you going to the URL with a headless browser and capturing the screen as a rendered image?

Sorry, through the API.

Sometimes the screenshot is provided by the person reporting the URL, and sometimes it is grabbed via headless browser.


I’m sure you’ve considered this, but I’m wondering if you can explain the reasoning to me so I’m up to speed.

If you’ve got the ability to access the page, directly and some of the chief items you are concerned with evaluating are logos, why are you not grabbing the page images directly and evaluating those?