Here is self-documenting code. I give it notebook-like, just to keep you busy copy-pasting.
Use the python “client” API SDK method, and a system role message
from openai import OpenAI
client = OpenAI()
system_message = [
{
"role": "system",
"content": [
{
"type": "text",
"text": "You are ImageAI, with built in computer vision."
}
]
}
]
I’ll give you example base64 images so you can run immediately.
pngpre = 'iVBORw0KGgoAAAANSUhEUgAAAIAAAABACAMAAADlCI9NAAAAGXRFWHRTb2Z0d2FyZQBBZG9iZSBJbWFnZVJlYWR5ccllPAAAAAZQTFRF////'
example_images = [
'MzMzOFSMkQAAAPJJREFUeNrslm0PwjAIhHv//09rYqZADzOBqMnu+WLTruOGvK0lhBBCCPHH4E7x3pwAfFE4tX9lAUBVwZyAYjwFAeikgH3XYxn88nzKbIZly4/BluUlIG66RVXBcYd9TTQWN+1vWUEqIJQI5nqYP6scl84UqUtEoLNMjoqBzFYrt+IF1FOTfGsqIIlcgAbNZ0Uoxtu6igB+tyBgZhCgAZ8KyI46zYQF/LksQC0L3gigdQBhgGkXou1hF1XebKzKXBxaDsjCOu1Q/LA1U+Joelt/9d2QVm9MjmibO2mGTEy2ZyetsbdLgAQIIYQQQoifcRNgAIfGAzQQHmwIAAAAAElFTkSuQmCC',
'AAAAVcLTfgAAAPRJREFUeNrsllEKwzAMQ+37X3owBm0c2VZCIYXpfXVBTd9qx5uZEEIIIcQr8IHjAgcc/LTBGwSiz5sEoIwTKwuxVCAW5XsxFco3Y63A3BawVWDMiFgiMD5tvELNuh/r5sA9Nu1yiYaXvBBLBawUAGubsZU5UOy8HkNvINoAv27nMVZ1WC1wfwrspPk2FDMiVpYknNu6uIxAVWQsgBoSCCQxI2KEANFdXccXseZzuKMQQDFmt6pPwU9CL+CcADEJr6qFA1aWYIgZEesGEVgmTsGvfYyIdaPYwp6JwBRL5kD4Hs7+VWGSz8aEEEIIIYQQ/8VHgAEAxPsD+SYeZ2QAAAAASUVORK5CYII=',
'AAAAVcLTfgAAAPVJREFUeNrslsEOhCAMRNv//+nNbtYInRELoniYdyJC2hdsATMhhBBCiFfiG4vTT1XIx/LA0wJl0hUCIeU8g2QgSBiFelJOFoCq+I3+H8ox6aN8SeGK7QvW5XfghcA+B0WcFvBDgToWbEmVANvoigBO1AIGY6N9lKuBlgAsClJ0bLME2CKaB1Kx1RcEQmWxHfK7BFhpPyHAOus+AVxW9lG7BqYJ+IHAWRHajCKE+6/YgB6B4TaMBk4EPCPgwwIG5yfEOROIp3XvxU4fRO74UGr/d3J3pt837OqAm6cl0IrQ8zAcOacbERa+s4UQQgghhBBv5iPAAA3BAvjyKYgWAAAAAElFTkSuQmCC',
]
example_images = [pngpre + s for s in example_images]
Construct a detailed multi-image user message, with metadata description of the image to follow. This is where the challenge was had.
user_tiled_image_message = [
{
"role": "user",
"content": [
{
"type": "text",
"text": "Produce a per-image report of each image's contents."
},
{
"type": "text",
"text": "1. image filename example1.png:"
},
{
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{example_images[0]}", "detail": "low"}
},
{
"type": "text",
"text": "2. image filename example2.png:"
}, {
"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{example_images[1]}", "detail": "high"}
}
]
}
]
Then send it off:
(the two messages are already in a list, so lists can just be “added”.
response = client.chat.completions.with_raw_response.create(
model="gpt-4o-2024-08-06", max_tokens=500, top_p=0.01,
messages=system_message + user_tiled_image_message,
)
print(response.http_response.json()["choices"][0]["message"]["content"])
response.http_response.json()["usage"]
print(f"time: {response.elapsed.total_seconds():.2f}s")
This is documented in the API reference, but you’ve gotta expand the user message format, and expand, and expand…
gpt-4-1106-vision-preview
supports even another undocumented and useful image method, where an image is not tiled, nor is it resized down (among other things its API alone will accept).
Quality difference of AI from same input
=============== gpt-4-1106-vision-preview ===============
Image Content Report
1. Image Filename: example1.png
- Content Description: The image contains the word “Apple” in a simple, pixelated black font on a white background.
- Text Analysis: The text is clear and legible, styled in a basic sans-serif typeface.
2. Image Filename: example2.png
- Content Description: The image displays the word “Banana” in a pixelated black font on a white background.
- Text Analysis: The text is straightforward and readable, presented in a plain sans-serif font.
- Resolution: 64x128 pixels
Both images are text-based with no additional graphical elements, focusing solely on the representation of the words “Apple” and “Banana” respectively.
time: 6.43s
=============== gpt-4o-2024-08-06 ===============
Image Report
-
Image Filename: example1.png
- Contents: The image contains the text “Apple”.
-
Image Filename: example2.png
- Contents: The image contains the text “Banana”.
time: 3.14s