OpenAI API to read commercial invoices

I put a scanned pdf in images so that gpt-4o can read it. I want it to read products from a table inside the picture, but I keep getting weird results, no matter what instructions I give. I dont want to extract the text since the point is that it can read images. Any tips on fixing this?

1 Like

Hi @marieclaire.degroot and welcome to the community!

So PDF files by default are treated as text-only - there is a PDF parser employed under the hood, which is OK for paragraphs, but for tables the structure looks wonky and lot of the context for various columns and rows is lost.

My recommendation in this case would be to do a conversion of PDF pages to images, using a library such as this, and then encoding the pages/images using base64 encoding and sending it to Vision API as detailed here.

When sending to Vision API you would have a system prompt that specifies how to extract and represent the data in tables. I would recommend to specify either a Markdown output or some other clean structured format (like Yaml).

For example, if you would like tables to be represented as Yaml you would have something like this in the prompt:

**Table Formatting Instructions**
Format tables in YAML as per the following structure:
* Represent tables as inline yaml code block with root node `table:`
* Include `description`, `column_names`, `row_names` and `data`
* Format each row as row_name:{{col1: value, col2: value, ...}}

From here you have couple of options. You can include in your prompt also exactly what you want to extract, so now the model will (once it has “in its minds eye” done this table representation) do the extraction. Alternatively, if this doesn’t perform super great, you can get Yaml table representation output and do a 2nd call to the ChatCompletions API to extract the information - you supply the Yaml in your user prompt. In the latter case, it may be possible to even use a smaller (mini) model.

Hope this helps!

I am making this in C#.

I already put the pdf in images using tesseract. If I use this image in chatGPT, I get a completely right result, with the products extracted from the table. however when I’m using the API or try the assistant in playground, the result doesn’t come anywhere near right, it makes up the prices and doesn’t read all the products