Tabular data for finetuning a model

csawai · December 23, 2023, 6:52am

I possess a PDF document concerning insurance, primarily composed of tabular data. For instance, page 3 features a four-column layout, while page 5 presents three columns. My objective is to automate the extraction of answers based on specific key-value pairs. As an illustration, the first row of the first column might display “Cost of drugs,” with the corresponding answer located in the first row of the second column.Its like a sentence and then there is the cost. It’s not direct keypair.

For training purposes, I have generated prompts and transformed the PDF into text. However, this conversion process results in the loss of the table’s column-row structure. Consequently, my fine-tuned model begins to incorrectly retrieve answers from the third column, as per the example mentioned. How do you handle scenarios of tablular data finetuning?

Thanks

EricGT · December 23, 2023, 8:14pm

Odds are the tables in the PDF are not actually tables of data but data in the PostScript programming language which makes if very hard to extract. Also many applications that extract text from PDFs are not smart enough to extract tabular data but some are.

If this were my problem I would run several of the PDFs through a conversion application and then hand check if the table information is in a format that looks like at table, preferably an Excel table, JSON format or CSV if the application can do that.

You may have to even purchase an application that can extract out the tables.

Topic		Replies	Views
Trainining based on complex text API gpt-4 , chatgpt , api	8	1576	July 5, 2023
Search long pdf for specific table - possibly need fine tuning model API gpt-4 , fine-tuning , api	10	2790	March 29, 2024
Correct retrieval of figures from uploaded files GPT builders	2	524	January 14, 2024
Extracting Data From PDFs API	3	8748	January 24, 2024
Extract the table data from a semi-structured PDF Community gpt-35 , chatgpt	0	682	May 3, 2024

Tabular data for finetuning a model

Related topics