Using gpt4o as OCR fills data with invented data

Hi community,

I am developing an assistant that acts as an OCR for electricity bills. I have tried choosing the response as a json_object or even defining an entire json_schema.
What happens is that the AI is making up the data despite having it in the attached images, and even if I ask it “Are you sure you are not missing any data?” or “The value XXX are you sure that’s it?” it is able to correct it.

I’ve tried attaching images with json_object or json_schema, I’ve tried choosing text as response and enabling file_search to upload a PDF and have it process the text, but everything ends up in that to fulfill the given json it makes up most of the parts or it’s not able to read all the information.

My questions are what model should I use to get the vision right (I can’t find any gpt4-vision), should I use the PDF or the images (even though it has graphics, I’m only interested in the text inside)?

Thank you very much

2 Likes

I have a project in production that does something like what you describe. The problem you are having is that the OpenAI assistants when analyzing a PDF file only recognize the text that can be selected. That is, if the content you want to extract is in an image, it will never be recognized. My recommendation is to first convert the PDF file to an image (with a good resolution) and request with a completions that recognizes all the content of the image and configure a structured response with the json format that you prefer. The model I recommend is gpt-4o-mini. If you read the documentation you will see that this model is capable of viewing image content.

1 Like

Most bills probably have a standard font. I don’t think you need AI for this, pytesseract could be a good choice for you in this case as long as you pre-process the image enough.

You could try preprocessing the image with AI as well and you might get better results.
(Thresholding the image is a good practice when trying to extract text on higher contrast documents)

It might also be a good practice to downscale the resolution of the images to a smaller resolution - which sounds impractical at first, however, most AIs work better with fewer pixels to work with - as long as it isn’t too few.
This is more of a trial and error kind of thing.

Good luck! :hugs:

1 Like

I recommend converting the PDF to images (e.g. one PNG per page of the document) and processing those instead. That way you can guarantee the model will OCR any scanned text and won’t instead try to use the selectable text in the PDF, which may not include the information that you want to process.

2 Likes

Thanks for your reply, but using Completions from the playground sections throws a banner that says:

"Completion models are now legacy.

The Completions Playground will be removed from the Playground menu and only accessible by URL starting 1/14/25."

Should I take the Chat section to try it? I`m trying to build something with Chat section but I get only good results copy and pasting the whole PDF text, not with the images. When I work with the images it fill some data with invented or repeated data and only got the correct answer when I wrote “Are u sure about this?”

1 Like

Thansk for ur reply,
I’m just trying to do exactly that, I have 3 pages on the PDF and I generate 3 images. Then on chat I upload all the pages and ask something like “Give me all the data based on the images related to an electric bill that fills this json, if something is not clear just return an empty string like on the json below (and some explanation about the fields)”

Some exact data like payment card number its like “CARD NUMBER: 1111 2222 33343 444” but the json says “CARD NUMBER: 7777 8888 9999 0000”. And that data doesn’t even appear on the whole PDF.

If I use the Assistant section, sometimes it return “Name: John Doe” and that typical default data.

What I meant was to use the Chat option in the Playground. If you want to work with PDFs, I don’t think you can do this in this section. However, you can do this in the Assistant section. In that case, I’d say try using function_calling and structured outputs with Assistant. For images, you could use the Chat section to get plain text from those images.

1 Like

I’ve seen startups use yolov5 + llm vision model to get what they want

shoutout to openai to bring a non llm, pure machine learning, better than the “yolo github repo” model out there to the api


(lol, maybe it already uses this and its just non public info)

Hi,

Please see the examples offered here to use the gpt-4o models for vision tasks: Introduction to GPT-4o and GPT-4o mini | OpenAI Cookbook

2 Likes

Hi again with some news:

I’m following the guide data_extraction_transformation from cookbook and using it with python code + chat it works pretty well (with some misses). I copy pasted the same prompt to the playground, using the same images and the Chat section did pretty well too, but if I create an Assistant with same system prompt and same User input and files, it always fills data with invented data.

Even if I change the input images to electric bills, the Assistant always fills the Hotel data (the example given on the cookbook) but on Chat the output changes correctly. Im not understanding why…

Prompt:

You are an OCR-like data extraction tool that extracts hotel invoice data from PDFs.

    1. Please extract the data in this hotel invoice, grouping data according to theme/sub groups, and then output into JSON.

    2. Please keep the keys and values of the JSON in the original language.

    3. The type of data you might encounter in the invoice includes but is not limited to: hotel information, guest information, invoice information,
    room charges, taxes, and total charges etc.

    4. If the page contains no charge data, please output an empty JSON object and don't make up any data.

    5. If there are blank data fields in the invoice, please include them as "null" values in the JSON object.

    6. If there are tables in the invoice, capture all of the rows and columns in the JSON object.
    Even if a column is blank, include it as a key in the JSON object with a null value.

    7. If a row is blank denote missing fields with "null" values.

    8. Don't interpolate or make up data.

    9. Please maintain the table structure of the charges, i.e. capture all of the rows and columns in the JSON object.

Chat

{
    "Invoice": {
        "Electricity bill amount": "303.73 €",
        "Electricity bill details": {
            "Invoice number": "TEST123456",
            "Invoice issue date": "November 19, 2024",
            "Billing period": "From 10/17/2024 to 11/14/2024",
            "Charge date/payment deadline": "11/26/2024"
        },
        "Invoice summary": {
            "Power": "45.20 €",
            "Energy": "152.72 €",
            "Charges": "29.66 €",
            "Electricity tax": "12.15 €",
            "Rental": "1.29 €",
            "Applied tax": "21.00 %",
            "Total invoice amount": "303.73 €"
        },
...

Assistant (with same system prompt):

[
    {
        "Hotel Information": {
            "Name": "Hamburg City (Zentrum)",
            "Address": "Willy-Brandt-Straße 21, 20457 Hamburg, Deutschland",
            "Phone": "+49 (0) 40 3039 379 0"
        },
        "Guest Information": {
            "Name": "APIMEISTER CONSULTING GmbH",
            "Guest": "Herr Jens Walter",
            "Address": "Friedrichstr. 123, 10117 Berlin"
        },
...

Are you following the cookbook through to using JSON mode (response_format of json_object) or even transforming it to (response_format: json_schema, schema)?

If so, the AI will basically produce what is told of it by schema, it doesn’t have a way to refuse or adapt. The model not wanting to start the schema in JSON mode will get you loops of newlines and tabs.

The best way to counter this is is an anyOf schema in your root, with two possible schemas, one for production of JSON object for hotel invoice, and one for refusal or generic extraction of ocr text.

I’m not quite sure what difference assistants would make, as it supports both structured outputs: you talking (JSON mode), or you passing a schema (strict structured outputs). Supposing you were to programmatically parse the output: the case of the AI making up its own object to suit the data would not be a good idea for the following database usage. Perhaps the “chat” code didn’t receive schema at all? (the cookbook uses a file).