Best approach for extracting data from diverse invoice PDFs using OpenAI - Seeking guidance on model selection and training strategy

Hello OpenAI community,

I’m developing a solution to extract structured data from invoice PDFs. Here’s my specific scenario:

Current Setup:

  • Need to process various invoice formats (different layouts and page lengths)
  • Currently using standard API calls with detailed prompts (~15,000 tokens)
  • Getting inconsistent results when running identical prompts multiple times

Key Questions:

  1. Which OpenAI approach would be most suitable:
  • Standard API with optimized prompts
  • Fine-tuning
  • Embeddings
  1. I currently have a limited dataset of invoices. Would it be viable to:
  • Start with a free-tier service to collect more training data from users
  • Use this collected data to fine-tune the model later
  • Gradually improve accuracy as the dataset grows

Output:

I want a structured json output

Has anyone implemented something similar or can suggest the most effective approach? I’m particularly interested in strategies for building up a training dataset while providing initial value to users.

Thanks in advance for your guidance!

2 Likes

Have you tried tweaking the hyperparameters?
The easiest to change would be the temperature. Try the lowest setting closest to 0 so you get more accurate responses.

Alternatively you might need to fine tune or use embeddings.

On some LLMs you can define what the AI’s response should start with which would give you a more controlled answer.

This might also be of use.

Good luck! :hugs:

2 Likes

Hi @davy.martial3 and welcome to the community!

How I would do this:

  1. Convert each page in the PDF into an “image” using PyMuPDF
  2. Send the images with the system prompt to GPT-4 Vision API. Go easy with the prompt - don’t make it too elaborate, and enforce output in a semi-structured format, such as “key: value” pair format, YAML, or similar
  3. Take the response and send with your JSON schema to GPT-4o with structured outputs enabled

I would start with this first, before going down the fine-tuning road. For fine-tuning you want to have as much data that is specific to your problem as possible. In other words - as many invoices in the specific formats that you expect.

3 Likes

Thank I actually just did, I put 0 to avoid any “freestyle” are there any other hyperparameters that you would consider?

Thank you, do you think the approach of working with prompt first, giving access to beta users, and then use their invoices to train my fine tuning model later on could work?

This depends on the endpoint you are using;

This would be for the chat endpoint. There you can see most parameters. :smile:

Sure! Starting with the prompt first is the way to go, and you can safely assume that the models will get even better, so you may not need finetuning at all, but you always have that in the back pocket as you accumulate more data.

The approach I described is what I successfully used in parsing very complex visually rich investment based documents, containing tables and charts. For invoices, which typically have a more consistent structure, it should work even better.

1 Like