I possess a PDF document concerning insurance, primarily composed of tabular data. For instance, page 3 features a four-column layout, while page 5 presents three columns. My objective is to automate the extraction of answers based on specific key-value pairs. As an illustration, the first row of the first column might display “Cost of drugs,” with the corresponding answer located in the first row of the second column.Its like a sentence and then there is the cost. It’s not direct keypair.
For training purposes, I have generated prompts and transformed the PDF into text. However, this conversion process results in the loss of the table’s column-row structure. Consequently, my fine-tuned model begins to incorrectly retrieve answers from the third column, as per the example mentioned. How do you handle scenarios of tablular data finetuning?
Thanks