I chose the “bug” category, and not the “I’m confused” category. 
There’s not exactly a limiter, but there’s a downscaler, but which one has to infer backwards from a single useful example, and then run image size trials against edge cases to see if you got it right.
Then you discover that whether 1024, 1536, or the example’s 1452 tokens should be billed, the API is reporting about 763. Worse would be if this is what is actually happening, 763 vectorizations instead of full-quality vision.
How GPT-4.1 Calculates Image Tokens (Official Method)
GPT-4.1 calculates token counts for image inputs based on dividing the image into a grid of small patches, each patch being exactly 32 \times 32 pixels. The maximum allowed patch count (thus tokens billed) is 1536.
If the initial number of patches exceeds this maximum, the image is scaled down proportionally (preserving aspect ratio), ensuring it fits within this limit.
Below is the precise calculation and algorithmic logic.
Step 1: Initial Patch Calculation (no scaling)
Given an original image resolution of W \times H pixels, calculate the initial number of patches along width and height:
\text{initialPatchW} = \left\lceil \frac{W}{32} \right\rceil
\text{initialPatchH} = \left\lceil \frac{H}{32} \right\rceil
The total initial patch count is thus:
\text{initialTotal} = \text{initialPatchW} \times \text{initialPatchH}
- If \text{initialTotal} \leq 1536, no resizing occurs. The token count is exactly \text{initialTotal}.
- Otherwise, proceed to Step 2 for scaling.
Step 2: Approximate Scaling (preserving aspect ratio)
If scaling is necessary, first apply an approximate scaling factor to bring the total patch count near the allowed maximum (1536):
The scaling factor is computed as follows:
\text{firstScale} = \sqrt{\frac{1536 \times 32^2}{W \times H}}
Applying this scale factor to the image dimensions gives approximate scaled dimensions:
\text{width1} = \lfloor W \times \text{firstScale} \rfloor
\text{height1} = \lfloor H \times \text{firstScale} \rfloor
Compute the intermediate (non-integer) patch dimensions after the first scaling:
\text{patchW1} = \frac{\text{width1}}{32}, \quad \text{patchH1} = \frac{\text{height1}}{32}
Step 3: Precise Patch Alignment (exact integer patches)
To ensure that patches align exactly to integer counts, perform a precise second scaling step:
- First, choose width as the reference dimension and set the final width patches to the integer just below the approximate patches:
\text{finalPatchW} = \lfloor \text{patchW1} \rfloor
- Compute the exact adjustment scaling factor based on this width patch count:
\text{adjustmentScale} = \frac{\text{finalPatchW}}{\text{patchW1}}
- Apply this exact scaling factor uniformly to both dimensions:
\text{widthFinal} = \lfloor \text{width1} \times \text{adjustmentScale} \rfloor
\text{heightFinal} = \lfloor \text{height1} \times \text{adjustmentScale} \rfloor
- Now calculate the final height patches as the ceiling division of the new height:
\text{finalPatchH} = \left\lceil \frac{\text{heightFinal}}{32} \right\rceil
- The total final tokens (patches) are:
\text{finalTokens} = \text{finalPatchW} \times \text{finalPatchH}
Step 4: Final Check (always performed)
Although extremely rare, ensure as a final step that the token count does not exceed the maximum:
- If \text{finalTokens} > 1536, adjust by removing one patch row from height dimension:
\text{finalPatchH} = \text{finalPatchH} - 1
- Recalculate height dimension and tokens accordingly:
\text{heightFinal} = \text{finalPatchH} \times 32
\text{finalTokens} = \text{finalPatchW} \times \text{finalPatchH}
This guarantees compliance with the 1536 patch limit.
Concrete Example from OpenAI Documentation (1800 \times 2400 pixels):
Step 1: Initial patches
\left\lceil \frac{1800}{32} \right\rceil = 57
\left\lceil \frac{2400}{32} \right\rceil = 75
- Total patches: 57 \times 75 = 4275 > 1536, scaling needed.
Step 2: Approximate scaling
- Compute first scale factor:
\text{firstScale} = \sqrt{\frac{1536 \times 32^2}{1800 \times 2400}} \approx 0.603
\text{width1} = \lfloor 1800 \times 0.603 \rfloor = 1086
\text{height1} = \lfloor 2400 \times 0.603 \rfloor = 1448
\text{patchW1} = \frac{1086}{32} \approx 33.94, \quad \text{patchH1} = \frac{1448}{32} \approx 45.25
Step 3: Precise alignment
- Final integer patches (width reference):
\text{finalPatchW} = \lfloor 33.94 \rfloor = 33
\text{adjustmentScale} = \frac{33}{33.94} \approx 0.972
- Precisely adjusted dimensions:
\text{widthFinal} = \lfloor 1086 \times 0.972 \rfloor = 1056
\text{heightFinal} = \lfloor 1448 \times 0.972 \rfloor = 1408
\text{finalPatchH} = \left\lceil \frac{1408}{32} \right\rceil = 44
\text{finalTokens} = 33 \times 44 = 1452
Thus, the final dimensions are exactly 1056 \times 1408 pixels with 33 \times 44 = 1452 tokens.
Pseudocode Summary (for direct programming implementation)
FUNCTION CalculateTokens(W, H):
initialPatchW = ceil(W / 32)
initialPatchH = ceil(H / 32)
initialTotal = initialPatchW × initialPatchH
IF initialTotal ≤ 1536:
RETURN (tokens=initialTotal, width=W, height=H)
firstScale = sqrt((1536 × 32²) / (W × H))
width1 = floor(W × firstScale)
height1 = floor(H × firstScale)
patchW1 = width1 / 32
finalPatchW = floor(patchW1)
adjustmentScale = finalPatchW / patchW1
widthFinal = floor(width1 × adjustmentScale)
heightFinal = floor(height1 × adjustmentScale)
finalPatchH = ceil(heightFinal / 32)
finalTokens = finalPatchW × finalPatchH
IF finalTokens > 1536:
finalPatchH = finalPatchH - 1
heightFinal = finalPatchH × 32
finalTokens = finalPatchW × finalPatchH
RETURN (tokens=finalTokens, width=widthFinal, height=heightFinal,
patchesW=finalPatchW, patchesH=finalPatchH)