Resize parameter for gpt-4-vision-preview

The " Processing and narrating a video with GPT’s visual capabilities and the TTS API" (I can’t link to it directly) lists a resize parameter as part of the request body as seen in this snippet:

PROMPT_MESSAGES = [
    {
        "role": "user",
        "content": [
            "These are frames from a video that I want to upload. Generate a compelling description that I can upload along with the video.",
            *map(lambda x: {"image": x, "resize": 768}, base64Frames[0::10]),
        ],
    },
]
params = {
    "model": "gpt-4-vision-preview",
    "messages": PROMPT_MESSAGES,
    "api_key": os.environ["OPENAI_API_KEY"],
    "headers": {"Openai-Version": "2020-11-07"},
    "max_tokens": 200,
}

result = openai.ChatCompletion.create(**params)
print(result.choices[0].message.content)

The documentation for the vision api doesn’t list resize as a parameter and also specifies this format instead:

{
 "type": "image_url",
"image_url": {
              "url": f"data:image/jpeg;base64,{base64_image}"
    }
}

The codebook from the cookbook does work with openai==0.28 but I can’t tell if the resize parameter actually does anything

The images are either processed as a single tile 512x512, or after they are understood by the AI at that resolution, the original image is broken into tiles of that size for up to a 2x4 tile grid.

That’s a parameter “detail”:“low” that is apparently not default to set the single-tile mode.

Lets read about their own image mangling first:

Image inputs are metered and charged in tokens, just as text inputs are. The token cost of a given image is determined by two factors: its size, and the detail option on each image_url block. All images with detail: low cost 85 tokens each. detail: high images are first scaled to fit within a 2048 x 2048 square, maintaining their aspect ratio. Then, they are scaled such that the shortest side of the image is 768px long. Finally, we count how many 512px squares the image consists of. Each of those squares costs 170 tokens. Another 85 tokens are always added to the final total.

Here are some examples demonstrating the above.

  • A 1024 x 1024 square image in detail: high mode costs 765 tokens
    1024 is less than 2048, so there is no initial resize.
    The shortest side is 1024, so we scale the image down to 768 x 768.
    4 512px square tiles are needed to represent the image, so the final token cost is 170 * 4 + 85 = 765.
  • A 2048 x 4096 image in detail: high mode costs 1105 tokens
    We scale down the image to 1024 x 2048 to fit within the 2048 square.
    The shortest side is 1024, so we further scale down to 768 x 1536.
    6 512px tiles are needed, so the final token cost is 170 * 6 + 85 = 1105.
  • A 4096 x 8192 image in detail: low most costs 85 tokens
    Regardless of input size, low detail images are a fixed cost.

So even though you are paying for 2x2 tile expansion and you send 1024x1024, they would squash your 1024x1024 to 768x768 for some reason. Send a dall-e-3 image of 1792x1024 and AI tiles get 1344 x 768 = 3x2 instead of 1536x878 = 3x2. Truly bizarre.

So then what does this do?

 {
            "role": "user",
            "content": [
                "How much of this image has wasted space border with no content?",
                {"image": "(imagedata)", "resize": 768},
                {"image": "(imagedata)", "resize": 512},

            ],
        },

I suspect that if I were to send the 512 parameter, I would get the low detail image made of a single tile.

I don’t feel like spending money to answer questions, however there’s some code in another thread. I modified it there to print the tokens of input, and also gave a different function that can resize locally, so you can send the “detail” parameter, send the “resize” entry in the dictionary, and see what is rejected and what costs you low and high detail prompt tokens.

1 Like

I just want to echo that I’m confused about this as well.

Inputting User Images don’t seem to be documented or mentioned at all on the API References the chat endpoint. I’m still not exactly clear on what the resize parameter does since it’s only shown in passing on this cookbook: https://cookbook.openai.com/examples/gpt_with_vision_for_video_understanding (link broken because I’m not allowed to embed links??)

Would appreciate it if Open AI would provide a more detailed API reference for using gpt-4-vision. It’s quite a pain needing to dig through a quickstart guide and an example cookbook just to uncover all the various parameters supported.

For documentation, you just need to go to API Reference and expand “chat” messages to user message, and keep expanding.

The method currently documented in yaml and Azure swagger is:

[
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What’s in this image?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": f"data:image/jpeg;base64,{base64_image}",
            "detail": "low"
            
          }
        }
      ]
    }
  ]

This is the essence of the cookbook code:

Simpler parallel calls/multiple images

[
  {
    "role": "user",
    "content": [
      "Describe images",
      {
        "image": {base64_image1},
        "resize": 768
      },
      {
        "image": {base64_image2},
        "resize": 768
      },
      {
        "image": {base64_image3},
        "resize": 768
      }
    ]
  }
]

I’ve done several experiments with different image sizes and different parameters, and in all cases, I am only reported the 85 tokens of a single base image.

The resize parameter replaced with “rezize” gets a 400 invalid error, so it is validated.

The method is able to do OCR that is impossible if I manually resize to 512.

OCR out for 100 tokens prompt

{‘id’: ‘chatcmpl-888’, ‘object’: ‘chat.completion’, ‘created’: 1702174553, ‘model’: ‘gpt-4-1106-vision-preview’, ‘usage’: {‘prompt_tokens’: 100, ‘completion_tokens’: 500, ‘total_tokens’: 600}, ‘choices’: [{‘message’: {‘role’: ‘assistant’, ‘content’: ‘The image contains text which appears to be from an academic or technical document. The visible text is as follows:\n\n—\n\nWe also undertake a systematic study of “data contamination” – a growing problem when training high capacity models on datasets such as Common Crawl, which contain potentially huge coverage of potential test datasets and obscure such dataset-stripping effects. Across these experiments we observe few discernible patterns in GPT-3's performance gains, including on down-stream benchmarks. Across all benchmarks and tasks we measure a relatively smooth power-law relationship between task performance and model size, although with a few interesting departures from this trend.\n\nIn our second set of experiments we provide the first ever broad-based benchmark of sparse expert models, and we provide detailed comparisons with the capabilities and limitations of dense models across 40 different quantitative and qualitative measures. We show evidence that in several situations a hybrid dense-sparse mixture model achieves the best results.\n\nIn the third set of experiments we pit GPT-3 against various benchmarks that aim to measure a models ability to reason, use common sense, or use background knowledge. Across these tests we find that scaling up model size consistently improves performance; however, gains diminish with scale across all tasks, behaving similarly to previous findings with smaller scale models.\n\nIn the final set of experiments we investigate whether GPT-3's large scale provides new capabilities, or just provides more of the same capabilities displayed by GPT-2, we discuss effects on generalization, reasoning, and various forms of knowledge in detail. We also conduct several case studies within specialized domains such as SAT analogies, Engligh-as-a-second-language reading comprehension, and trivia. We further explore GPT-3's limitations around its “world model”: its representations of common sense, factualness, and bias.\n\n2 Our Approach\n\nOur basic pre-training approach, including models, data, and training, is similar to the process described in [Rad19], which heavily relies on scaling including the data and training time and naturally extends length of training. Our core training involves a large amount of data, most notably the entirety of the Common Crawl – a dataset that for the purposes of this work we estimate to be on the order of 45TB in size [RCF+19], but we also train on a plethora of different settings that we believe are crucial for GPT-3's efficacy, each building on work we systematically explore different settings for significant model training characteristics like model size especially scaling up accordingly as the raw’}, ‘finish_details’: {‘type’: ‘max_tokens’}, ‘index’: 0}]}


Example python without library
import base64
import requests
import os

# Path to your image
image_path = "./ocr.png"

# Function to encode the image
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')

# Getting the base64 string
base64_image = encode_image(image_path)

headers = {
  "Content-Type": "application/json",
  "Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}"
}

payload = {
  "model": "gpt-4-vision-preview",
  "max_tokens": 500,
  "messages": [
                  {
                    "role": "user",
                    "content":
                    [
                      "Provide a full transcription of the image text",
                      {
                        "image": base64_image,
                        "resize": 768,
                      }
                    ]
                  }
              ]
    }


response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)

if response.status_code != 200:
    print(f"HTTP error {response.status_code}: {response.text}")
else:
    print(response.json())

(I also tested the March developer livestream method – to be hit with 91k input tokens and “It appears to be a string of text rather than an actual image” making experimentation $1)

Have you tried setting the detail to high on the resized 512x512 tile? I was getting the same issue until I tried with high and then every image was being handled at 99%+ accuracy.

No, I just sent the small image via the same method where the 1024 version was recognized, to see if any resize was being done when “resize”: 512. The 512px image resized myself to see if the AI was magic. The method does not have the “high” option. Just confirmed again that that parameter still has no impact even at 112.

With the resized image on documented method and “high”:

Pure hallucination on the GPT-3 paper

We demonstrate a systematic study of “Plasmonic-assisted” approaches when focusing light very much below the diffraction limit (<100nm FWHM), which can potentially improve current fiber nano-tip technology and enable new devices for ultra-high resolution optical lithography and data storage. Our plasmonic approach uses metallic nano-structures with a dielectric coating to confine and enhance light to a sub-100 nm focal spot at the dielectric-metal interface. The metal's high reflectivity serves as a back mirror and the dielectric as a spacer to maintain a distance from the metal surface, which is advantageous for reducing loss.\n\nIn addition to the above, we are studying materials and methods suitable for our applications by understanding their capabilities and limitations under ultra-fast optical pulses, vastly used in those applications. The infrared (IR) regime offers us a broad

See if you can OCR the 512px version with your own eyeballs…(with reference of where 512px grid might be in a 1024px image)