OpenAI API OCR isn't as successful as chatGPT

Hello,
I have been trying to use the OpenAI API to extract text data out of construction plans, specifically numbers. I have tried uploading the picture directly, as well as editing it by removing as much as irrelevant data as possible, but the response I get from the API is partial at best, and most of the time doesn’t even work at all. however when loading the same picture into the CharGPT engine, I get almost perfect results. how do I get the API to be as successful as ChatGPT when it comes to OCR, or other aspects for that matter.

2 Likes

Interesting question.

Are you using ChatGPT without history? I would try it in a temporary chat to see if you get the same results.

Pat

Hi Omer, I’m familiar with the discrepancy you mentioned. For instance, when I upload a picture of a grocery receipt to ChatGPT in the browser, it is able to extract the information perfectly, assuming it uses its vision capabilities and doesn’t try to use code execution to OCR the image. However, if I upload the same content as a PDF, it bypasses the vision capabilities of the model and instead gets OCR’d in the backend, and the OCR results are passed to the model for interpretation. This will invariably produce a lower quality extraction of the text.

I would suggest ensuring that you pass the construction plans as an image (.png works well in my experience) in the API, using whatever encoding and handling is recommended for images by the API docs these days. I would also check to make sure that you have code execution disabled, and that in the prompt you explicitly tell the model to use its vision capabilities to interpret the image. If code execution is enabled, the model may be defaulting to writing and executing python code that performs OCR on the image you uploaded, and then analyzing the output of that code. This will be much worse than if the model simply “looks at” the image you shared.

Hope this helps! :slight_smile:

The api has quite a few settings including image quality, your prompt and the model you use that can all affect your outcome. Maybe you can share a few more details? Possibly an example picture and the parameters you use to make the API call? That way we can give it a try as well! I do a lot of extraction / conversions - some with OpenAI and some with third party (LLamaParse) through their API so I’m always interested in different approaches and outcomes!


I’m uploading the rotated image I sent through the API, I have also done a bit of editing to the contrast but I don’t think it helped very much

I’ve sent this picture to a clear chat without history and got GREAT recognition results, however when sending through an API I sometimes get random numbers like 17 or 4, and sometimes a generic response like:
“It seems that the text extraction did not yield any results. This could be due to several reasons, such as the image quality, the orientation of the text, or the presence of distortions. Let’s try a different approach by processing the image to enhance text recognition.”

I will try to alter the prompt based on ariessunfeld’s suggestion and see if it helps

edit:

with this prompt:
"You are looking at a construction plan image (JPG format). Do not attempt to perform OCR or use code to analyze this image. Instead, use your native vision capabilities to directly interpret the drawing.

Your task is to extract all dimension values (in centimeters) that are clearly visible in the image. These can be either vertical or horizontal, and may include:

Long dimensions (e.g., wall length)

Section widths or offsets

Sloped lengths

Segment distances between slashes

Also identify any text labels that accompany these dimensions (e.g., “תפר” which means “joint”).

Important rules:

Do not use code execution or OCR. Use vision directly.

If a dimension has start and end markers (diagonal slashes), note that explicitly.

Ignore annotations that are not dimension-related.
Output format:

Return a simple list of the dimension values you find, separated by commas:

[number1, number2, number3, …]

Do not include any extra text, explanations, or formatting."

and the results were a bit disappointing, seems to be a sort of hallucination:
[300, 450, 120, 200, 150, 500, 75, 250, 100, 350]

would playing with the temperature help? I set it on a relatively low value to boost accuracy

1 Like

When you say ‘through the API’ - what model are you using? Can you share the actual size image too so we can play with it as well?

1 Like

I was curious and put it onto the playground, it got these results:
264, 234, 30, 278, 256, 206, 10, 1968, 800

First I thought it was due to playground not being able to force a high resolution, so I tried directly on API, but it didn’t change much.
264, 234, 30, 607.8, 256, 206, 10, 19.6, 10, 800, 5320

I then run it on ChatGPT 4o (web) and it also didn’t respond well, although it improved:
[264, 234, 30, 871.9, 256, 206, 10, 196.1, 800, 5320]

Then I run ChatGPT o4-mini, and this one seems to be perfect (I’m not an engineer so I might be wrong):
264, 234, 30, 87.09, 89.61, 256, 206, 10, 10, 800, 5320

What did it do differently? Apparently code interpreter cropped small parts for zoom, and re-run several times:

I tried running it on the API with o4-mini-high:
264, 234, 30, 6078, 256, 206, 10, 10, 198, 800, 5320

And o3:
264, 234, 30, 6012, 1968, 256, 206, 10, 800, 5320

My conclusion this far is that ChatGPT (web)'s advantage is code interpreter, that runs python to manipulate the images for better results.

Currently, it is not supported by either completions nor responses API:

2 Likes

Thank you for the detailed response!

what would you suggest I’d do to advance, I assume I could run python to make the initial manipulation before sending the image to the API, which I did to some extent.
how do I know what manipulation the chat does in order to mimic it most accurately?

We don’t know and can’t know. There is chatgpt-4o-latest that could let you see if it is about the API’s encoding of images, or if it is the model.

gpt-4-vision-preview had an undocumented method to pass images at larger sizes than 512px without the tiling billing, and it resulted in seeing what could not be seen for the same 85 tokens.

Consider that any image to a “tiling” model will be downsized so the shortest dimension is maximum of 768 pixels. A page at XXX x 768 is not great. Such a PDF-based document is showing here.

The technique I would use if you really want to pay for perception:

Render at and slice the page into 2048x512 from the top down with some overlap (or 2048x768 doubles your cost). Intersperse with “text” along the lines of “page 1, slice 1”

gpt-4.1 mini or o4-mini have a different resize algorithm, that allows more variations in the size received by the model. Here is the optimum resize of a letter page aspect ratio:

image

I’ve tried using assistants which have code interpreter, but newer models aren’t supported and the ones available didn’t respond well.

If you put it into ChatGPT web (o4-mini or o3), click to see it’s thinking details while it is running. Code interpreter looks like a python code where the model is processing something then it disapears from the logs, only visible during thinking. Later it becomes only that piece of log that I posted earlier.

At this moment I don’t have any concrete ideas, you could make a function to allow the model to request a zoom in a certain area, put some colors or labels to identify the are to be zoomed, but that would be extremely troublesome and inferior to the web version, but it is a possibility if you are really in need for a solution.

My best hope is that we get improvements in the responses API that takes us closer to the web version.

Hi,

I have been working with OCR tool. Happy to share that I have had considerable success in my project. Would be happy to collaborate with you and try to refine your effort..

Cheers.