Resize parameter for gpt-4-vision-preview

The " Processing and narrating a video with GPT’s visual capabilities and the TTS API" (I can’t link to it directly) lists a resize parameter as part of the request body as seen in this snippet:

        "role": "user",
        "content": [
            "These are frames from a video that I want to upload. Generate a compelling description that I can upload along with the video.",
            *map(lambda x: {"image": x, "resize": 768}, base64Frames[0::10]),
params = {
    "model": "gpt-4-vision-preview",
    "messages": PROMPT_MESSAGES,
    "api_key": os.environ["OPENAI_API_KEY"],
    "headers": {"Openai-Version": "2020-11-07"},
    "max_tokens": 200,

result = openai.ChatCompletion.create(**params)

The documentation for the vision api doesn’t list resize as a parameter and also specifies this format instead:

 "type": "image_url",
"image_url": {
              "url": f"data:image/jpeg;base64,{base64_image}"

The codebook from the cookbook does work with openai==0.28 but I can’t tell if the resize parameter actually does anything

The images are either processed as a single tile 512x512, or after they are understood by the AI at that resolution, the original image is broken into tiles of that size for up to a 2x4 tile grid.

That’s a parameter “detail”:“low” that is apparently not default to set the single-tile mode.

Lets read about their own image mangling first:

Image inputs are metered and charged in tokens, just as text inputs are. The token cost of a given image is determined by two factors: its size, and the detail option on each image_url block. All images with detail: low cost 85 tokens each. detail: high images are first scaled to fit within a 2048 x 2048 square, maintaining their aspect ratio. Then, they are scaled such that the shortest side of the image is 768px long. Finally, we count how many 512px squares the image consists of. Each of those squares costs 170 tokens. Another 85 tokens are always added to the final total.

Here are some examples demonstrating the above.

  • A 1024 x 1024 square image in detail: high mode costs 765 tokens
    1024 is less than 2048, so there is no initial resize.
    The shortest side is 1024, so we scale the image down to 768 x 768.
    4 512px square tiles are needed to represent the image, so the final token cost is 170 * 4 + 85 = 765.
  • A 2048 x 4096 image in detail: high mode costs 1105 tokens
    We scale down the image to 1024 x 2048 to fit within the 2048 square.
    The shortest side is 1024, so we further scale down to 768 x 1536.
    6 512px tiles are needed, so the final token cost is 170 * 6 + 85 = 1105.
  • A 4096 x 8192 image in detail: low most costs 85 tokens
    Regardless of input size, low detail images are a fixed cost.

So even though you are paying for 2x2 tile expansion and you send 1024x1024, they would squash your 1024x1024 to 768x768 for some reason. Send a dall-e-3 image of 1792x1024 and AI tiles get 1344 x 768 = 3x2 instead of 1536x878 = 3x2. Truly bizarre.

So then what does this do?

            "role": "user",
            "content": [
                "How much of this image has wasted space border with no content?",
                {"image": "(imagedata)", "resize": 768},
                {"image": "(imagedata)", "resize": 512},


I suspect that if I were to send the 512 parameter, I would get the low detail image made of a single tile.

I don’t feel like spending money to answer questions, however there’s some code in another thread. I modified it there to print the tokens of input, and also gave a different function that can resize locally, so you can send the “detail” parameter, send the “resize” entry in the dictionary, and see what is rejected and what costs you low and high detail prompt tokens.