I’ve been using the OpenAI API to process invoice summaries with images. I have a function that takes images and a prompt to generate a summary. Here’s the function I’m using:
tried changing the image sizes from the original 1153x1536 to 3756x5000, but the token usage still remains the same.
This doesn’t seem right, based on the documentation which mentions:
high will enable “high res” mode, which first allows the model to first see the low res image (using 85 tokens) and then creates detailed crops using 170 tokens for each 512px x 512px tile.
Am I doing something wrong here, or is this a known issue/bug? Any insights or advice would be greatly appreciated!
I just did a deep dive into what you can expect for token usage (and rate usage) for a variety of resolutions, detail settings, and models.
If the image resolution at detail:high takes the same number of tiles, the cost will be the same. This means anything from 513x513 to 1024x1024, or anything in between, results in 4 overlay tiles (on top of a base “low” image.)
There are also peculiarities in the internal downsizing even on detail:high. Your image will be downsized so the shortest dimension is at most 768 pixels. Send 3000x3000, the model sees 768x768 - 4 tiles of 512x512. Send 2000x500, the model sees 2000x500, also 4 tiles of 512x512.
Just to clarify, since my images always have the same ratio (A4 format), does that mean it doesn’t make any difference if I increase the resolution beyond 768px on the smaller side? From what I understand, as long as the smaller side exceeds 768px, the model will downsize it to 768px, and the token usage will remain the same regardless of any further increase in resolution, correct?
A large A4 paper image (actually any A paper size in tall aspect ratio) would always resize to 1087x768. That then would consume six token tiles of high detail, as the longest dimension exceeds two tiles.
You can consider then the economy of sending 1024x725 as the image, where the expense would drop to four high quality tiles.
Or consider the quality increase if you were to use this strategy:
The page is sized to 1024x1448 by your code.
You take a vew of the top at 1024x768, and a view of the bottom at 1024x768
88 pixels of overlap between the two images give some commonality for the vision to join.
Those two images placed into the same user message.
Paying for two four tile images instead of one single tile image
The AI would have higher resolution text and more tokens of encoded image in general to contain information.