Tabular data for finetuning a model

I possess a PDF document concerning insurance, primarily composed of tabular data. For instance, page 3 features a four-column layout, while page 5 presents three columns. My objective is to automate the extraction of answers based on specific key-value pairs. As an illustration, the first row of the first column might display “Cost of drugs,” with the corresponding answer located in the first row of the second column.Its like a sentence and then there is the cost. It’s not direct keypair.

For training purposes, I have generated prompts and transformed the PDF into text. However, this conversion process results in the loss of the table’s column-row structure. Consequently, my fine-tuned model begins to incorrectly retrieve answers from the third column, as per the example mentioned. How do you handle scenarios of tablular data finetuning?


Odds are the tables in the PDF are not actually tables of data but data in the PostScript programming language which makes if very hard to extract. Also many applications that extract text from PDFs are not smart enough to extract tabular data but some are.

If this were my problem I would run several of the PDFs through a conversion application and then hand check if the table information is in a format that looks like at table, preferably an Excel table, JSON format or CSV if the application can do that.

You may have to even purchase an application that can extract out the tables.