Understanding GPT-Vision API pricing?

I count the same 227 completion tokens by reassembling your assistant text and sending to a tokenizer.

The user message is 7 tokens overhead + 12 tokens of message.

So if we have 765 tokens of prompt still to account for, it must be from images.

Total tiles 4
Base tokens 85
Tile tokens 170 × 4 = 680
Total tokens 765

The internal resizing of the smallest side of the image is what makes 640x640 and even up to 1024x1024 take 4 “tiles” that are 512x512 in the detail:high mode.

For an image 640x640, detail:low will make it only cost 85 tokens with the image being downsized to 512x512 internally.