I count the same 227 completion tokens by reassembling your assistant text and sending to a tokenizer.
The user message is 7 tokens overhead + 12 tokens of message.
So if we have 765 tokens of prompt still to account for, it must be from images.
| Total tiles | 4 |
|---|---|
| Base tokens | 85 |
| Tile tokens | 170 × 4 = 680 |
| Total tokens | 765 |
The internal resizing of the smallest side of the image is what makes 640x640 and even up to 1024x1024 take 4 “tiles” that are 512x512 in the detail:high mode.
For an image 640x640, detail:low will make it only cost 85 tokens with the image being downsized to 512x512 internally.