OCR using API for text extraction

sirik · August 4, 2024, 10:00am

I have a PDF. It’s all images. I want to use ChatGPT’s API to perform OCR on it and extract the text in to a .json file and a .txt file. Is the OCR only for the customers who have take the subscription ? What model should I use? If anyone knows any blogs, youtube videos or github repos about this, please let me know.

RiavvioAS · August 8, 2024, 3:40pm

I’ve been doing some experiments for a client project: in order to do that you need to use a workflow like this.

extract each page of the pdf as single images. You can use some tools online or you can use python libraries.
check the image resolution: for hight quality scans i’ve seen good result at 2000 pixel of height. For a manual which has low quality images, i had to resize them at 3500 pixel of height.
Process each image with the APIs. The Vision APi will extract the text and describe illustrations (something a normal OCR can’t do!)
Save the resulting chat a single text file.
you can also ask for a structured output in json format, save it and then use python to save a text version.
Loop the process for each image.
And that’s it.

It’s not quick, for about 50 pages it took 12 minutes, but the result was excellent. The cost was about 1 dollar.

tech36 · October 15, 2024, 4:20pm

So a somewhat related question, how well does vision handle hand written text?

RiavvioAS · October 18, 2024, 1:18pm

it depends.

Grocery shopping list: pretty well
hand written notes on invoices: usually well
Doctor scribble: pretty bad.

StroeAndre · December 14, 2024, 7:39pm

Which model have you used to reach 1$ for 50 pages?

I’ve been trying and testing with lower resolutions with the GPT-4o-mini model, and the output is pretty accurate with a very very low pricing.

Using low resolution the input(the image of the page) was around ~7.000/9.000 tokens, and the output much lower than the input. Let’s say an average of a ~1.000 output-tokens per page. depends on how much details you want.

Higher resolutions seems to be in the range of 20.000 to 35.000 input-tokens per request.

With GPT-4o-mini:
Lower-resolution average price per page: < 0.01$
Higher-resolution average price per page: 0.01$

Lower-resolution average price per 50 pages: 0.09$
Higher-resolution average price per 50 pages: 0.25$

Am I missing something?

_j · December 14, 2024, 9:58pm

First, why use GPT-4o-mini? They priced it so it is actually more expensive for images than the latest GPT-4o.

Then, read the docs about internal image resizing.

Low gets you a maximum dimension of 512px
High gets you a maximum for the smaller side of 768px

So if you want high-quality PDF, sending the documented way, you either would do wide slices of 1536x512 to pay for three tiles of high intelligibility, and then continue in future tasks with vertical overlaps with a “continue from”, or you would do your own custom slicing and overlaps at “low” to a destination of 512px wide.

Something that is sent at 3000x4000 gets you 768x1024 seen. Or maybe just a bit more if not exact, to up the tile count.

StroeAndre · December 15, 2024, 3:20pm

Are you sure about this? Even tho GPT-4o-mini uses many more input-tokens than GPT-4o for images, the final cost per 100 requests seems to be less with GPT-4o-mini

I uploaded a 1653 × 2339 image to both of the models. The results:

GPT-4o: 1.113t input. 725t output.
GPT-4o-mini: 36.8kt input. 742t output.

Based on these values, GPT-4o would cost $2.06 per 100 requests, compared to $0.60 for GPT-4o-mini.

Are there other parameters they use to price the vision?

_j · December 15, 2024, 8:37pm

OpenAI massively cranks up the billed tokens artificially to ensure there is no bargain vision.

gpt-4o: 774 input tokens

gpt-4o-mini: 25510 input tokens

However, this also includes non-vision text in your case. Put the actual fixed cost of 85 tokens “low” and 170 tokens per tile through the token cost.

The amplification can be discovered on the API pricing page’s image calculator, forcing you to make the discovery. Producing the same price against gpt-4o-2024-05-13 and double the price for cheaper gpt-4o-2024-08-06.

so tricky as force you to use a different calculator for mini instead of quickly switching:

Mini actually provides the more satisfying answer, instead of “look it up yourself, pal”. Where you should have less confidence in the smaller AI model actually knowing.

StroeAndre · December 16, 2024, 7:37am

OH I see now, huh! I did the calculation on the APIs and now everything makes more sense.

Thanks for the explanation!

alisheikhali360ai · December 18, 2024, 7:27am

Hi dear RiavvioAs, could you send me the part of the code which check the image resolution and the loop process for each image? if it is ok, please send me here

Topic		Replies	Views
How to Programmatically Extract Text from Images Using GPT-4 API gpt-4 , chatgpt , api , assistants-api	9	7230	October 14, 2024
Make OpenAI Vision API Match GPT4 Vision API chatgpt	4	3821	December 6, 2023
Can an assistant help me with OCR? API gpt-4	7	3434	June 6, 2024
OpenAI API OCR isn't as successful as chatGPT API gpt-4 , api , ocr	10	404	May 13, 2025
Best practice scanned PDF / What model to use? API chatgpt , plugin-development , api , gpt-4-vision	3	885	February 19, 2025

OCR using API for text extraction

Related topics