Resize parameter for gpt-4-vision-preview

For documentation, you just need to go to API Reference and expand “chat” messages to user message, and keep expanding.

The method currently documented in yaml and Azure swagger is:

[
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What’s in this image?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": f"data:image/jpeg;base64,{base64_image}",
            "detail": "low"
            
          }
        }
      ]
    }
  ]

This is the essence of the cookbook code:

Simpler parallel calls/multiple images

[
  {
    "role": "user",
    "content": [
      "Describe images",
      {
        "image": {base64_image1},
        "resize": 768
      },
      {
        "image": {base64_image2},
        "resize": 768
      },
      {
        "image": {base64_image3},
        "resize": 768
      }
    ]
  }
]

I’ve done several experiments with different image sizes and different parameters, and in all cases, I am only reported the 85 tokens of a single base image.

The resize parameter replaced with “rezize” gets a 400 invalid error, so it is validated.

The method is able to do OCR that is impossible if I manually resize to 512.

OCR out for 100 tokens prompt

{‘id’: ‘chatcmpl-888’, ‘object’: ‘chat.completion’, ‘created’: 1702174553, ‘model’: ‘gpt-4-1106-vision-preview’, ‘usage’: {‘prompt_tokens’: 100, ‘completion_tokens’: 500, ‘total_tokens’: 600}, ‘choices’: [{‘message’: {‘role’: ‘assistant’, ‘content’: ‘The image contains text which appears to be from an academic or technical document. The visible text is as follows:\n\n—\n\nWe also undertake a systematic study of “data contamination” – a growing problem when training high capacity models on datasets such as Common Crawl, which contain potentially huge coverage of potential test datasets and obscure such dataset-stripping effects. Across these experiments we observe few discernible patterns in GPT-3's performance gains, including on down-stream benchmarks. Across all benchmarks and tasks we measure a relatively smooth power-law relationship between task performance and model size, although with a few interesting departures from this trend.\n\nIn our second set of experiments we provide the first ever broad-based benchmark of sparse expert models, and we provide detailed comparisons with the capabilities and limitations of dense models across 40 different quantitative and qualitative measures. We show evidence that in several situations a hybrid dense-sparse mixture model achieves the best results.\n\nIn the third set of experiments we pit GPT-3 against various benchmarks that aim to measure a models ability to reason, use common sense, or use background knowledge. Across these tests we find that scaling up model size consistently improves performance; however, gains diminish with scale across all tasks, behaving similarly to previous findings with smaller scale models.\n\nIn the final set of experiments we investigate whether GPT-3's large scale provides new capabilities, or just provides more of the same capabilities displayed by GPT-2, we discuss effects on generalization, reasoning, and various forms of knowledge in detail. We also conduct several case studies within specialized domains such as SAT analogies, Engligh-as-a-second-language reading comprehension, and trivia. We further explore GPT-3's limitations around its “world model”: its representations of common sense, factualness, and bias.\n\n2 Our Approach\n\nOur basic pre-training approach, including models, data, and training, is similar to the process described in [Rad19], which heavily relies on scaling including the data and training time and naturally extends length of training. Our core training involves a large amount of data, most notably the entirety of the Common Crawl – a dataset that for the purposes of this work we estimate to be on the order of 45TB in size [RCF+19], but we also train on a plethora of different settings that we believe are crucial for GPT-3's efficacy, each building on work we systematically explore different settings for significant model training characteristics like model size especially scaling up accordingly as the raw’}, ‘finish_details’: {‘type’: ‘max_tokens’}, ‘index’: 0}]}


Example python without library
import base64
import requests
import os

# Path to your image
image_path = "./ocr.png"

# Function to encode the image
def encode_image(image_path):
  with open(image_path, "rb") as image_file:
    return base64.b64encode(image_file.read()).decode('utf-8')

# Getting the base64 string
base64_image = encode_image(image_path)

headers = {
  "Content-Type": "application/json",
  "Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}"
}

payload = {
  "model": "gpt-4-vision-preview",
  "max_tokens": 500,
  "messages": [
                  {
                    "role": "user",
                    "content":
                    [
                      "Provide a full transcription of the image text",
                      {
                        "image": base64_image,
                        "resize": 768,
                      }
                    ]
                  }
              ]
    }


response = requests.post("https://api.openai.com/v1/chat/completions", headers=headers, json=payload)

if response.status_code != 200:
    print(f"HTTP error {response.status_code}: {response.text}")
else:
    print(response.json())

(I also tested the March developer livestream method – to be hit with 91k input tokens and “It appears to be a string of text rather than an actual image” making experimentation $1)