Using GPT-4-Turbo to fill out complex PDF forms

Without providing many details, my company is an intermediary between other companies (clients) and end users (customers). We receive many different PDF forms from these clients which must be filled in with many customers’ data.

These forms can change or be replaced quite regularly and can have many pages worth of fields to be filled in, so hardcoding the logic to fill them in requires a lot of work each time one of the form changes.

What I am trying to do is to use OpenAI’s API to have GPT-4-Turbo fill out the PDF forms for me. In summary, I provide the model with the PDF form as well as all of the customer’s data we have in JSON, and then ask the model to put the correct data point in each field.

In reality, it is not as simple as this, mainly because although the model can view and read the PDFs, it cannot fill them in. So, my current approach is the following:

Tagging the PDF:
Luckily, each PDF has embedded fields/widgets (i.e. input text boxes) which can be read and edited programmatically. I iterate over all of these widgets and write a red integer ID in each, so that it looks like this:
image

Mapping the user data to each field:
Each page could have up to 50+ fields, and each PDF could have 10+ pages. I found that passing all of this in a single request overwhelmed and confused the model, so instead I send each page 1 by 1 and the results are much better.

I also found that using the Assisants API with an uploaded file of the PDF for retrieval would lead to very poor results where the model could not seem to pair the red IDs with the field names, hence it was filling in the form with the correct data, but assigning the value to the wrong integer ID. I assume retrieval extracted the PDFs as raw text to be embedded, so the spatial aspects of the pairs of field names and IDs and colours of the IDs are lost.

Instead, converting each page of the PDF into a high quality PNG image and adding it as image content to the message in the request solved this issue.

So, I now make a request to the Completions API with GPT-4-Turbo and provide:

  • a PNG image of a page of the PDF tagged with an red integer ID on each field
  • a JSON object of all of the customer’s data e.g.:
{
    "passport_number": 
    "X123456789",
    ...
}
  • Instructions on what the task is, including instructions to respond in JSON (and response_format={"type": "json_object"})in the format of:
{
    "<FIELD_ID>": "<VALUE>".
    ...
}

Filling the PDF:
I then parse the JSON response and programmatically fill each in the widget on the PDF corresponding the the <FIELD_ID>.

Results are very good for a single page!

Problem:
The main problem I am now having is that the model does not have context about what it was doing on the previous page, so if a section spills over on to the next page without repeating the section heading or anything, the results are incorrect.

An example of this is a form which requires personal details of a husband and wife. provide the model with both the husband and wife’s data. Page 1 has a section titled “Husband’s Personal Details” and then 3/4 way down the page, the next section titled “Wife’s Personal Details” begins and continues onto page 2. However, the top of page 2 does not have any title or context that it is a continuation of the wife’s details, so it starts filling in the husband’s data again.

I am not sure how to fix this. I am also sending each page as a new message in the same thread, so each request includes the messages with the previous images too.

Questions:

  1. Does my overall approach seem decent? Having to tag the PDF with these red IDs and then convert it to an image doesn’t seem like the best approach.
  2. Is using GPT-4-Turbo (i.e. Vision) the correct choice over the Assistants API?
  3. Any suggestions on how to solve this main issue of sections being broken between pages.