Why does ChatGPT perform better than the API for OCR tasks?

Hello everyone,
So i got some pdfs containing tables that i want to extract, those tables are hard to extract using libraries like pypdf. So i thought about using OCR, i used Chatgpt (4o) and it performed well. So i tried to use the API and do the same:




def encode_image(pil_image):
    buffered = BytesIO()
    pil_image.save(buffered, format="JPEG")
    return base64.b64encode(buffered.getvalue()).decode("utf-8")

def gpt_extract_text_from_image(image):
    base64_image = encode_image(image)

    response = client.responses.create(
        model="gpt-4o",
        input=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "input_text",
                        "text": (
                            "Extract the full table from this image with all data, including headers.\n"
                            "You are free to format the output as you see fit for clarity. Just make sure all data is readable and aligned correctly."
                        ),
                    },

                    {
                        "type": "input_image",
                        "image_url": f"data:image/jpeg;base64,{base64_image}",
                    },
                ],
            }
        ],
    )
    return response.output_text

pdf_path = "mypdf.pdf"
images = convert_from_path(pdf_path, dpi=700)
first_image = images[1]  

extracted_text = gpt_extract_text_from_image(first_image)

print(extracted_text)

Except that the output contained a lot of errors, I tried making the resolution higher (I used a DPI of 1200, which results in a very high resolution). I dont understand the difference between chatgpt and the API ? And can you suggest any other alternatives ?

Understand the limitations of vision:

The image size will be reduced so that the shortest side is maximum 768 pixels.

This makes a page have a resolution along the lines of 768x990, under 90 DPI.

You may have better results in slicing pages. Send pieces with a bit of overlap that are 1024x512, wider. Or even up to 1536x512, matching the resolution of underlying tiling. Then also note with interspersed “type:text” along the lines “page 2, section 3” for understanding.

Isolating tables to appear in a single image is difficult going in blind, and the AI has vision limitations in unraveling tables, keys, and legends.

1 Like

Hello,
Thank you so much for your remark, i did not know it resizes automaticaly the image when the dimensions are too big …
For a document page, I took multiple pictures with overlap and i adjusted the DPI to be higher, then i did a second pass using the LLM to clean the extracted content (remove the overlap), and I got a table 100% identical to the one in the pdf.

1 Like