I have a lot of pdfs for my company. And there are a lot of tables. So I am now confused on what is the most effective way to give tabular data as in input to the LLMs? Since I am a newbie on this, any help would be much appreciated.
What you need to do so is change the pdf data to an ascii format.
There are some easy tools for that like pdf2text but that won’t give you tabular data.
You could use pdfminer or poppler utils as well, but that’s not going to give you 100% accuracy.
So I suggest you first create an image from the pdf with ghostscript.
Then use tesseract to do some ocr magic.
And then transform that output to hocr which you could give to the LLM in a prompt or which you could create embeddings from.
You could even create html from that and make a pdf
Thanks. I created an image of the pdf. And I used OCR too to extract the text. But i am unsure how LLM would know the corresponding mapping of the keys to the values of the table from the extracted text from image file of the pdf with table.
Suppose this is the table, what would i do with the extracted text from OCR? Would GPT model be able to give me corresponding correct values for each questions asked?
yes, but it really depends on your ability to prompt.
Oh okay, will try. Also i would like to ask a question. Suppose if I have a lot of points and sub points and sub sub points in my pdf file, which I want to feed to the embedding and sequentially to the gpt model, what suggestions do you have to have an effective output? I also want to know what sort of text cleaning should we do generally?
Sounds like a tree to me.
Maybe you also want to checkout this: