For me, the following code produces nonsensical results because of the blank page image.
Am I doing something wrong, or is this just expected token completion behavior / hallucination or is there something embedded in the seemingly blank image?
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Please transcribe the image accurately into markdown text with appropriate headings. Also, provide alternative text for images if any."},
{
"type": "image_url",
"image_url": {
"url": "https://i.imgur.com/3Ohm46U.png",
},
},
],
}
],
max_tokens=300,
)
print(response.choices[0])
Yes, this is normal or expected behaviour when supplying a blank image or an image that has poor resolution and where its contents are not detectable.
You can build in a control into your prompt by including an additional instruction along the lines of the following (further tailor as required): If the image is blank, i.e. contains no visual content, please return blank image as your response.
This will prevent the model from returning a hallucination as response.
You can ask the model to transcribe only if the image has any text.
See example below:
from openai import OpenAI
client = OpenAI()
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "If the image has text, please transcribe the image accurately into markdown text with appropriate headings, else reply with 'no text detected'. Also, provide alternative text for images if any."},
{
"type": "image_url",
"image_url": {
"url": "https://i.imgur.com/3Ohm46U.png",
},
},
],
}
],
max_tokens=300,
)
print(response.choices[0])