I have a PDF. It’s all images. I want to use ChatGPT’s API to perform OCR on it and extract the text in to a .json file and a .txt file. Is the OCR only for the customers who have take the subscription ? What model should I use? If anyone knows any blogs, youtube videos or github repos about this, please let me know.
I’ve been doing some experiments for a client project: in order to do that you need to use a workflow like this.
- extract each page of the pdf as single images. You can use some tools online or you can use python libraries.
- check the image resolution: for hight quality scans i’ve seen good result at 2000 pixel of height. For a manual which has low quality images, i had to resize them at 3500 pixel of height.
- Process each image with the APIs. The Vision APi will extract the text and describe illustrations (something a normal OCR can’t do!)
- Save the resulting chat a single text file.
- you can also ask for a structured output in json format, save it and then use python to save a text version.
- Loop the process for each image.
- And that’s it.
It’s not quick, for about 50 pages it took 12 minutes, but the result was excellent. The cost was about 1 dollar.
So a somewhat related question, how well does vision handle hand written text?
it depends.
- Grocery shopping list: pretty well
- hand written notes on invoices: usually well
- Doctor scribble: pretty bad.
Which model have you used to reach 1$ for 50 pages?
I’ve been trying and testing with lower resolutions with the GPT-4o-mini model, and the output is pretty accurate with a very very low pricing.
Using low resolution the input(the image of the page) was around ~7.000/9.000 tokens, and the output much lower than the input. Let’s say an average of a ~1.000 output-tokens per page. depends on how much details you want.
Higher resolutions seems to be in the range of 20.000 to 35.000 input-tokens per request.
With GPT-4o-mini:
Lower-resolution average price per page: < 0.01$
Higher-resolution average price per page: 0.01$
Lower-resolution average price per 50 pages: 0.09$
Higher-resolution average price per 50 pages: 0.25$
Am I missing something?
First, why use GPT-4o-mini? They priced it so it is actually more expensive for images than the latest GPT-4o.
Then, read the docs about internal image resizing.
- Low gets you a maximum dimension of 512px
- High gets you a maximum for the smaller side of 768px
So if you want high-quality PDF, sending the documented way, you either would do wide slices of 1536x512 to pay for three tiles of high intelligibility, and then continue in future tasks with vertical overlaps with a “continue from”, or you would do your own custom slicing and overlaps at “low” to a destination of 512px wide.
Something that is sent at 3000x4000 gets you 768x1024 seen. Or maybe just a bit more if not exact, to up the tile count.
Are you sure about this? Even tho GPT-4o-mini uses many more input-tokens than GPT-4o for images, the final cost per 100 requests seems to be less with GPT-4o-mini
I uploaded a 1653 × 2339 image to both of the models. The results:
- GPT-4o: 1.113t input. 725t output.
- GPT-4o-mini: 36.8kt input. 742t output.
Based on these values, GPT-4o would cost $2.06 per 100 requests, compared to $0.60 for GPT-4o-mini.
Are there other parameters they use to price the vision?
OpenAI massively cranks up the billed tokens artificially to ensure there is no bargain vision.
gpt-4o: 774 input tokens
gpt-4o-mini: 25510 input tokens
However, this also includes non-vision text in your case. Put the actual fixed cost of 85 tokens “low” and 170 tokens per tile through the token cost.
The amplification can be discovered on the API pricing page’s image calculator, forcing you to make the discovery. Producing the same price against gpt-4o-2024-05-13 and double the price for cheaper gpt-4o-2024-08-06.
so tricky as force you to use a different calculator for mini instead of quickly switching:
Mini actually provides the more satisfying answer, instead of “look it up yourself, pal”. Where you should have less confidence in the smaller AI model actually knowing.
OH I see now, huh! I did the calculation on the APIs and now everything makes more sense.
Thanks for the explanation!
Hi dear RiavvioAs, could you send me the part of the code which check the image resolution and the loop process for each image? if it is ok, please send me here