Gpt-4-vision-preview handwriting transcription producing nonsense

So, using ChatGPT with the GPT4 model I’ve managed to get great results with manual transcription of handwritten historic documents. I’d now like to scale up and automate this process, so to test it I implemented the following code in line with the API examples:

import os
from openai import OpenAI
import base64
import mimetypes

client = OpenAI(api_key='apikey)

def image_to_base64(image_path):
    # Guess the MIME type of the image
    mime_type, _ = mimetypes.guess_type(image_path)
    
    if not mime_type or not mime_type.startswith('image'):
        raise ValueError("The file type is not recognized as an image")
    
    # Read the image binary data
    with open(image_path, 'rb') as image_file:
        encoded_string = base64.b64encode(image_file.read()).decode('utf-8')
    
    # Format the result with the appropriate prefix
    image_base64 = f"data:{mime_type};base64,{encoded_string}"
    
    return image_base64


def transcribe_image(image_path):

    base64_string = image_to_base64(image_path)
    # Make an API call to submit the image for transcription
    response = client.chat.completions.create(
    model="gpt-4-vision-preview",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Manually transcribe this handwriting"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": base64_string,
                        "detail": "low"
                    }
                },
            ],
        }
    ],
    max_tokens=300,
)

    # Print the transcription result
    print(response)

# Example usage
image_path = 'testimage.png'
transcribe_image(image_path)

and I used an example image of handwriting that ChatGPT had been able to extract the text from very well. However, when submitting it through the API call I get text returned that doesn’t even remotely resemble the text in the image. For example, the text in the image starts:

"A report of the By-law committee, and brought up by the Secty, was explained, that every brother would have a rough copy sent to him, before the next meeting "

and this is the response I get from the API:

ChatCompletionMessage(content='Sure, here is the transcription of the handwritten text:\n\nThe great orchard & lawn extended to the foot of a pretty precipitous hill, from one extremity of the house, north, it being rectangular, oblong, highly cultivated, and more proper of a rich soil, and when in bloom was indeed a charming sight. On its west side was the large vegetable garden, walled in, and to the west of that, still further, was a large pasture with some fine cattle, silky kine, which you know, I always liked and enjoyed, as well as a pretty piece of water, say pond, near the root of the hill, running to the N E or nearly so; - and in the pond, the beautiful White and other swans healthily, happily, quacking for the plentiful supply of excellent food given them, from which they were never distant, but when they chose to sail twenty yards to the opposite side, beautiful to look at;\nFrom the S E corner of the house, the view was enchanting, bounded by the mountain appropriate the Helderberg, but nearer to us the fruitful, highly cultivated beauteous farm, under high cultivation, and not the least agreeable, was the neat water sawmill with the small pond before the door and the larger one back of it; and still further east the splendid water flouring mill, when, if ever, happier days should again come to us, would have been for sale, whilst to', role='assistant', function_call=None, tool_calls=None))], created=1714403295, model='gpt-4-1106-vision-preview', object='chat.completion', system_fingerprint=None, usage=CompletionUsage(completion_tokens=300, prompt_tokens=98, total_tokens=398))

Am I somehow doing something incredibly stupid here or missing the point? This is my first time using the OpenAI API set and I would really appreciate any advice. Thank you in advance!

This problem made me curious so I just ran a couple of tests.

What I can confirm is that your code is fine and the prompt in principle works too. I suspect it could be the quality of your image. In my test noticed that when I provided the image of a very simple handwriting of mine in the lowest resolution, the model started to struggle to recognize individual words. However, for the same image, at a slightly higher resolution, it perfectly recognized the writing using the same code and prompt as you do.

The nature of response you are getting (i.e. the hallucination) reminds me a lot of the type of hallucinations I used to see when a model was accidentally provided with an empty string as input. So I suspect it fails to recognize the writing on the picture and as a result entirely makes up the content.

1 Like

When using low, the maximum image dimension is 512 pixels. A resize is automatically done.

That can mean an image 1920x1080 goes to 512x288 as input to the AI model. No way a page can be read.

Ask that same AI to use the Pillow image library (PIL) to make your own maximum size of an image side function of default 1024, and then at detail:high you’ll get 4x4 tiled image recognition (at significantly higher but still restrained cost)

There is also an alternate user message format that only accepts base64 and does not resize, so you have to ensure reasonable size yourself. It can see larger single-tile images, and the limitation is instead on how much context it can return (like text) before it hallucinates.

You can add @_j to a forum search and you might come across PIL powered functions for sending to AI in that message format…

2 Likes

Thank you very much! I hadn’t caught that using low would restrict the resolution that much. Switching it to high now delivers the same output as ChatGPT with GPT4. I’ll follow your suggestion and work to figure out the minimum resolution to achieve good results with reasonable cost. Really appreciate the input, this was very helpful thank you!

1 Like