Moving from gpt-4-vision-preview to gpt-4o Image URL Base64

I am trying to convert over my API code from using gpt-4-vision-preview to gpt-4o. I am passing a base64 string in as image_url. It works no problem with the model set to gpt-4-vision-preview but changing just the model to gpt-4o gives an error that gpt-4o requires image_url to be a link to an image. But according to the documentation it should work with a base64 string. I have tried gpt-4o-mini also. It only seems to work with gpt-4-vision-preview. Is there something I should be doing differently with gpt-4o?

Image of documentation:

My code is:
const mediaHistory = [
{
role: “user”,
content: [
{
type: “text”,
text: frameInstructions
},
{
type: “image_url”,
image_url: data:image/jpeg;base64,${imageBase64}
}
]
},
];

const messageHistory = [
{
role: “system”,
content: [
{
type: “text”,
text: instructions
}
]
},
{
role: “user”,
content: [
{
type: “text”,
text: chatText
}
]
}
];
//console.log(mediaHistory);
const opts = {
model: “gpt-4o”,
max_tokens: 300,
messages: […mediaHistory, …messageHistory]
};
const response = await openai.chat.completions.create(opts);

undocumented Correct Format for Base64 Images
The main issue developers face is using the correct format when sending base64-encoded images to the API. The solution is to structure the image data as follows:
json
{
“type”: “image_url”,
“image_url”: {
“url”: “data:image/jpeg;base64,<base64_encoded_image_data>”
}
}

Key points:
Use “type”: “image_url” instead of “type”: “image”
Include the full data URI scheme, including the MIME type (e.g., “data:image/jpeg;base64,”)

BUT … the encoding that is sent is almost always misinterpreted.

Makes me think about the importance of dev communities, even the web based OpenAI community has nothing on this… ridiculous… does anybody know which OpenAI community has the most traffic?

Here is self-documenting code. I give it notebook-like, just to keep you busy copy-pasting.

Use the python “client” API SDK method, and a system role message

from openai import OpenAI
client = OpenAI()

system_message = [
  {
    "role": "system",
    "content": [
      {
        "type": "text",
        "text": "You are ImageAI, with built in computer vision."
      }
    ]
  }
]

I’ll give you example base64 images so you can run immediately.


pngpre = 'iVBORw0KGgoAAAANSUhEUgAAAIAAAABACAMAAADlCI9NAAAAGXRFWHRTb2Z0d2FyZQBBZG9iZSBJbWFnZVJlYWR5ccllPAAAAAZQTFRF////'
example_images = [
'MzMzOFSMkQAAAPJJREFUeNrslm0PwjAIhHv//09rYqZADzOBqMnu+WLTruOGvK0lhBBCCPHH4E7x3pwAfFE4tX9lAUBVwZyAYjwFAeikgH3XYxn88nzKbIZly4/BluUlIG66RVXBcYd9TTQWN+1vWUEqIJQI5nqYP6scl84UqUtEoLNMjoqBzFYrt+IF1FOTfGsqIIlcgAbNZ0Uoxtu6igB+tyBgZhCgAZ8KyI46zYQF/LksQC0L3gigdQBhgGkXou1hF1XebKzKXBxaDsjCOu1Q/LA1U+Joelt/9d2QVm9MjmibO2mGTEy2ZyetsbdLgAQIIYQQQoifcRNgAIfGAzQQHmwIAAAAAElFTkSuQmCC',
'AAAAVcLTfgAAAPRJREFUeNrsllEKwzAMQ+37X3owBm0c2VZCIYXpfXVBTd9qx5uZEEIIIcQr8IHjAgcc/LTBGwSiz5sEoIwTKwuxVCAW5XsxFco3Y63A3BawVWDMiFgiMD5tvELNuh/r5sA9Nu1yiYaXvBBLBawUAGubsZU5UOy8HkNvINoAv27nMVZ1WC1wfwrspPk2FDMiVpYknNu6uIxAVWQsgBoSCCQxI2KEANFdXccXseZzuKMQQDFmt6pPwU9CL+CcADEJr6qFA1aWYIgZEesGEVgmTsGvfYyIdaPYwp6JwBRL5kD4Hs7+VWGSz8aEEEIIIYQQ/8VHgAEAxPsD+SYeZ2QAAAAASUVORK5CYII=',
'AAAAVcLTfgAAAPVJREFUeNrslsEOhCAMRNv//+nNbtYInRELoniYdyJC2hdsATMhhBBCiFfiG4vTT1XIx/LA0wJl0hUCIeU8g2QgSBiFelJOFoCq+I3+H8ox6aN8SeGK7QvW5XfghcA+B0WcFvBDgToWbEmVANvoigBO1AIGY6N9lKuBlgAsClJ0bLME2CKaB1Kx1RcEQmWxHfK7BFhpPyHAOus+AVxW9lG7BqYJ+IHAWRHajCKE+6/YgB6B4TaMBk4EPCPgwwIG5yfEOROIp3XvxU4fRO74UGr/d3J3pt837OqAm6cl0IrQ8zAcOacbERa+s4UQQgghhBBv5iPAAA3BAvjyKYgWAAAAAElFTkSuQmCC',
]
example_images = [pngpre + s for s in example_images]

Construct a detailed multi-image user message, with metadata description of the image to follow. This is where the challenge was had.

user_tiled_image_message = [
  {
    "role": "user",
    "content": [
      {
        "type": "text",
        "text": "Produce a per-image report of each image's contents."
      },
      {
        "type": "text",
        "text": "1. image filename example1.png:"
      },
      {
        "type": "image_url",
        "image_url": {"url": f"data:image/png;base64,{example_images[0]}", "detail": "low"}
      },
      {
        "type": "text",
        "text": "2. image filename example2.png:"
      },      {
        "type": "image_url",
        "image_url": {"url": f"data:image/png;base64,{example_images[1]}", "detail": "high"}
      }
    ]
  }
]

Then send it off:
(the two messages are already in a list, so lists can just be “added”.

response = client.chat.completions.with_raw_response.create(
  model="gpt-4o-2024-08-06", max_tokens=500, top_p=0.01,
  messages=system_message + user_tiled_image_message,
)
print(response.http_response.json()["choices"][0]["message"]["content"])
response.http_response.json()["usage"]
print(f"time: {response.elapsed.total_seconds():.2f}s")

This is documented in the API reference, but you’ve gotta expand the user message format, and expand, and expand…


gpt-4-1106-vision-preview supports even another undocumented and useful image method, where an image is not tiled, nor is it resized down (among other things its API alone will accept).


Quality difference of AI from same input

=============== gpt-4-1106-vision-preview ===============

Image Content Report

1. Image Filename: example1.png

  • Content Description: The image contains the word “Apple” in a simple, pixelated black font on a white background.
  • Text Analysis: The text is clear and legible, styled in a basic sans-serif typeface.

2. Image Filename: example2.png

  • Content Description: The image displays the word “Banana” in a pixelated black font on a white background.
  • Text Analysis: The text is straightforward and readable, presented in a plain sans-serif font.
  • Resolution: 64x128 pixels

Both images are text-based with no additional graphical elements, focusing solely on the representation of the words “Apple” and “Banana” respectively.
time: 6.43s

=============== gpt-4o-2024-08-06 ===============
Image Report

  1. Image Filename: example1.png

    • Contents: The image contains the text “Apple”.
  2. Image Filename: example2.png

    • Contents: The image contains the text “Banana”.
      time: 3.14s