Best approach for extracting data from diverse invoice PDFs using OpenAI - Seeking guidance on model selection and training strategy

davy.martial3 · November 4, 2024, 9:31am

Hello OpenAI community,

I’m developing a solution to extract structured data from invoice PDFs. Here’s my specific scenario:

Current Setup:

Need to process various invoice formats (different layouts and page lengths)
Currently using standard API calls with detailed prompts (~15,000 tokens)
Getting inconsistent results when running identical prompts multiple times

Key Questions:

Which OpenAI approach would be most suitable:

Standard API with optimized prompts
Fine-tuning
Embeddings

I currently have a limited dataset of invoices. Would it be viable to:

Start with a free-tier service to collect more training data from users
Use this collected data to fine-tune the model later
Gradually improve accuracy as the dataset grows

Output:

I want a structured json output

Has anyone implemented something similar or can suggest the most effective approach? I’m particularly interested in strategies for building up a training dataset while providing initial value to users.

Thanks in advance for your guidance!

j.wischnat · November 4, 2024, 11:46am

Have you tried tweaking the hyperparameters?
The easiest to change would be the temperature. Try the lowest setting closest to 0 so you get more accurate responses.

Alternatively you might need to fine tune or use embeddings.

On some LLMs you can define what the AI’s response should start with which would give you a more controlled answer.

This might also be of use.

Good luck!

platypus · November 4, 2024, 11:59am

Hi @davy.martial3 and welcome to the community!

How I would do this:

Convert each page in the PDF into an “image” using PyMuPDF
Send the images with the system prompt to GPT-4 Vision API. Go easy with the prompt - don’t make it too elaborate, and enforce output in a semi-structured format, such as “key: value” pair format, YAML, or similar
Take the response and send with your JSON schema to GPT-4o with structured outputs enabled

I would start with this first, before going down the fine-tuning road. For fine-tuning you want to have as much data that is specific to your problem as possible. In other words - as many invoices in the specific formats that you expect.

davy.martial3 · November 4, 2024, 1:07pm

Thank I actually just did, I put 0 to avoid any “freestyle” are there any other hyperparameters that you would consider?

davy.martial3 · November 4, 2024, 1:08pm

Thank you, do you think the approach of working with prompt first, giving access to beta users, and then use their invoices to train my fine tuning model later on could work?

j.wischnat · November 4, 2024, 1:10pm

This depends on the endpoint you are using;

This would be for the chat endpoint. There you can see most parameters.

platypus · November 4, 2024, 1:30pm

Sure! Starting with the prompt first is the way to go, and you can safely assume that the models will get even better, so you may not need finetuning at all, but you always have that in the back pocket as you accumulate more data.

The approach I described is what I successfully used in parsing very complex visually rich investment based documents, containing tables and charts. For invoices, which typically have a more consistent structure, it should work even better.

Topic		Replies	Views
Trainining based on complex text API gpt-4 , chatgpt , api	8	1646	July 5, 2023
How to Process PDF Files with OpenAI's Tools and APIs for Invoice Automation? API api , gpt-4-vision , ocr	1	810	January 15, 2025
Best OpenAI plan for document analysis with OCR and Power Automate? Community api	3	665	March 17, 2025
How to Extract Data from Images Using OpenAI API? API gpt-4	1	1878	October 18, 2024
Text parsing and producing the stable JSON output Prompting gpt-4 , gpt-35-turbo , api , json	2	2761	July 4, 2024

Best approach for extracting data from diverse invoice PDFs using OpenAI - Seeking guidance on model selection and training strategy

Related topics