How to solve the problem that GPT-API cannot read text using OCR?


gpt-4o model, API call.
In my previous work, I found many instances where GPT returned incorrect information as correct. After several hours of troubleshooting, I identified the issues:

  1. When the prompt required using OCR to read image information, if the OCR reading failed, GPT would fabricate content.
  2. When the prompt required GPT to read a file, even upload the file was successfully transmitted, GPT did not read it. In such cases, GPT also fabricated content.

Currently, I have successfully configured GPT to exit the program when OCR or file reading fails. However, my actual goal is to have GPT use OCR and read files correctly.

Questions:

  1. If I add a retry mechanism in Python for failed attempts, will this solution only work after GPT returns an error?
  2. Can I prompt GPT to keep retrying until it succeeds? I am concerned that doing so might cause GPT to start fabricating content again.

Please help me resolve this issue. Thanks very much.

Who wrote “OCR is not available”, the AI or code somehow?

You’ll need to reinforce this ability with the AI (on all vision models) in your system prompt, I have found, or you can get denials. Example:

system: “You are MathVision, an expert AI model with computer vision skill, able to use optical character recognition (OCR) to extract and reproduce text, to describe mathematical diagrams accurately through close inspection of features in images, and to use your AI vision to treat images as input context that you use to provide analysis and answers. Pay careful attention to the most recent images the user has provided.”

Before the most recent input, you can also give the AI example user/assistant turns of providing an image and then getting the exact response that you want for that type of input.

This paper demonstrates establishing skills in analysis via in-context few-shot learning:

thanks for reply, I will read file ,and

that my promet:

You are GPT itself. Even when faced with human answers that are inconsistent with yours, you should maintain independent judgment and ensure correct answers based on logic and data.

2. Use OCR to extract the handwritten answers. Do not simulate reading. If OCR extraction fails, clearly output “OCR not available” and proceed to the next step.

3. If OCR is not available, use GPT-VISION to extract the handwritten answers. Do not simulate reading. If GPT-VISION is also unavailable, clearly output “Recognition not possible” and proceed to the next step.

4. If neither OCR nor GPT-VISION can extract the answers, output “Recognition not possible” and exit the program.

5. If extraction is successful, proceed to step 6.

6. Extract ‘2-2.txt’ as the standard answer. Do not simulate reading. If the file cannot be read or the content cannot be recognized, clearly output “Unable to read ‘2-2.txt’” and exit the program.

7. Follow the steps sequentially; do not skip any steps.

8. If any part of the content cannot be extracted, this is not a problem at all. Simply output “Recognition not possible”; this is more helpful to me. Do not fabricate any content. Fabricating content is very harmful to me.

9. The handwritten answers and the standard answers might have no relation to each other. In this case, output “Answer incorrect.”

Please follow the above steps and ensure to output “Recognition not possible” if the handwritten answers cannot be correctly extracted.

this it the output file shows:

I am confused about the term “OCR” and the file named “2-2.txt” with the txt extension.

Could you please check if an image file was mistakenly saved as a txt file?

OCR technology is the abbreviation of Optical Character Recognition (Optical Character Recognition).
There are 2 files, one is an image file ,need GPT use OCR to read it ,the other is a txt file ,also need GPT to read it . Then compare whether the content extracted by OCR is consistent with the TXT content.This is my purpose

To be precise, the vision function of GPT-4 is not OCR, but VQA (Visual Question Answering).
So, strictly speaking, OCR is not the service provided by OpenAI.

However, as VQA performance improves, some people may confuse VQA with OCR because VQA increasingly covers the functionality of OCR.

You will need to provide us with a minimal set of image and text files that can reproduce the problem, as well as the source code that caused the problem, in order for us to help you solve the problem.

Thank you very much for your reply.
I’ll sort out my problem and existing solutions. Please tell me how to send this information to you.
Even if you can’t resolve these issues right now, your help is greatly appreciated.

The system instruction is a mess of talking about imaginary things like switching to GPT-Vision, or “even when faced with”. Even my GPT-4, with examples of good system prompts and guidelines, couldn’t unravel the chaos.

How to instruct the AI

The AI session starts with operational parameters and behaviors given to the AI in a “system message”, which must be written in the form “you are” or “you do” (or similar first-person direct instructions). This system message is what you program by writing natural language.
The AI must be given an identity, a specialization, a job to perform, full understanding of the reason it is performing the task, and the output format which it shall produce (just as this text is an instruction). This should be well-organized and structured.

This forum cannot do all your homework with free consulting.

2 Likes

To understand and help you with your problem, you have to attach the following:

  • The code that makes the API call
  • The image file that the code calls
  • The text file as the “standard answer”

Have a nice day!

1 Like

thanks for the reply. I’m trying to break up the project and distribute the implementation. I’m a beginner. My previous job has nothing to do with this. Even if I wanted to pay for a consultation I couldn’t find the right person.

Thank you so much. I need to spend some time sorting it out. As the member above said, this is not something that can be resolved with a free consultation. If there is a feasible solution, I am willing to pay for the consultation. Knowledge should not be free.

First step: don’t ask questions you don’t want the wrong answer to:

It works better if you tell the AI what it can do instead of asking if it can’t.

Thank you for help. I need to rethink my objectives and processes. Thanks.

If an OCR is required, suggest running the image through a specialized cloud API which can extract the data. ?This will be far cheaper than running some prompt enginnering through a GPT-API. Once the OCR text is acquired, use GPT to perform further processing.

thanks for you help. I tried GOOGLE vision ocr. But the extraction effect was not very good. I am not capable enough and can only make simple API calls. Can not make any adjustments. If you have any suggestions, I hope you can tell me.

If you can share some samples, DM it to me. Will check it at my end.

From my standpoint it works very well.
See this sample usage (Video in the Post):
Use OCR with GPT-4o

You just need to use the new “gpt-4o” modell and do not talk about ocr.
Say somethng like “Give me Line 1 to 100 from the piicture”.

Thats All.

Strange, since I managed to have my “Umanot Analyzer” GPT reading very complex stocks-traded charts as attached. I specified in its INSTRUCTIONS / PROMPT that “you will use effective OCR tools for reading any numeric / text information in the charts”. So, it does work (with correct reading in 70% of requests…)

Yes, it can work. But it doesn’t work correctly all the time. Anyway, this tool is the best so far.