Trainining based on complex text

I have an interesting challenge that involves processing semi-structured PDF files using a data entry team. Their task is to extract data from these files, which consists of key-value pairs.I mean the key:paris are not directly mentioned. It would be like : "XYZ fullterm cost is “$400”. At times it might be a bit complex. I would like to incorporate OpenAI’s technology into this process. The keys remain constant, but the corresponding answers vary depending on the customer data. This is where I believe AI/ML can be applied effectively.

Here’s what I have attempted so far: I converted the PDF files to text format and fed the entire text data to ChatGPT. Then, I provided a prompt stating that for a given set of data, the key is “xyz” and the answer is “$400”. ChatGPT generated a response with some context. Afterwards, I informed ChatGPT that I would be providing more sample data and asked it to give me the answer to the “xyz” key based on the new sample data, which it correctly provided.

Now, I would like guidance on how to proceed with this project. I have tried fine-tuning the model, but I’m unsure about the approach I took, which involved creating a JSONL format with all the key-label pairs derived from the PDF files. I don’t believe this is the right approach, and I’m seeking advice on the correct way to tackle this.

One possible method is many shot prompting, you show it examples of the “correct” way to do the task, give it 100, 200, as many as is needed examples and then run that large prompt against your data.

One of the fundamental powers of AI is it’s ability to turn unstructured data into structured data, but that ability, as it is in humans, is never 100%, you can get 99.5%(ish)

just as a primer your prompt would look something like

"You are a key-pair finding expert AI, you will find the corrisponding key pair half when prompted with the other half, here are some examples in ### makers of that being done correctly, please use them as a guide
###
prompt : my roof was $400 | correct response : roof - 400$ (just an example)

###

Now this is the actual data I wish you to work on, please provide your best response to
{user_query}
"

1 Like

Piggybacking off of @Foxalabs , one-shot, few-shot, and many-shot prompts are extremely powerful and can be used to great effect.

But, if you wanted to continue with fine-tuning, I would ask how many examples did you have in your training set?

Are the key:value pairs always contained within the same sentence?

Does every sentence have a key:value pair in it?

I imagine if the answers to my last two questions are “yes” and “no” respectively, you might have good luck with a multi-step approach.

  1. Parse the text from the PDF into individual sentences.
  2. Run a classifier to determine if there is a key:value pair present in each sentence.
  3. Run an extraction model to pull the key:value pair from each sentence which contains such a pair.

Doing it this way greatly simplifies the problem as each model is only responsible for one thing and one thing only.

It also allows you to better understand where and why failures occur to allow you to better fine tune each model independently.

Beyond that, make sure you have enough examples in your training set to properly train each model.

If you don’t already have enough examples, you might consider generating synthetic data using existing data as few-shot or many-shot examples.

1 Like

The dataset I’m working with pertains to insurance and is typically in the form of a PDF file spanning around 5 to 7 pages. Within this file, I am specifically interested in a select number of items, usually ranging from 20 to 30, which can be represented as key-value pairs. For example, the file might mention that the insurance premium amounts to $400, where “premium” serves as the key and “$400” as the corresponding value.

Based on this setup, I compiled a training dataset consisting of approximately 400 prompts. Each prompt includes a specific key, such as “deductible,” along with its corresponding value, like “$1,200.”

After fine-tuning the model, I attempted to test it using a sample text that contained information about the premium. In the playground, I queried the model to provide me with the premium amount, but instead of the desired response, I received an unrelated explanation of what a premium is.

One thing that’s jumping out at me is that—based on this description—your usage prompts aren’t the same as your training prompts.

Interesting. Can you give some example of how it should be?

Initially it seemed as though you wanted to take a PDF and extract a data set of key:value pairs which you could then use in a different context.

Now it appears you’re wanting to be able to ask questions of a PDF—something for which there are about 50 plugins to accomplish.

So, I guess to give any further advice I’d need to better know what exactly you are trying to accomplish and what the purpose of the key:value pairs is, in your mind.

1 Like

I have a web application that requires input of insurance-related details, which are typically found in PDF files. So a dataentry person goes throght the pdf file and then based on his judgfement, fills fills the values against the keys.

I am seeking a workflow where I can upload the PDF file to OpenAI’s API, and in return, retrieve the corresponding key values. These retrieved key-value pairs can then be seamlessly integrated into my web application.

Then I would refer you back to my first post in this topic. I think that method would have the best chance of success.