I have a lot of pdfs for my company. And there are a lot of tables. So I am now confused on what is the most effective way to give tabular data as in input to the LLMs? Since I am a newbie on this, any help would be much appreciated.
What you need to do so is change the pdf data to an ascii format.
There are some easy tools for that like pdf2text but that won’t give you tabular data.
You could use pdfminer or poppler utils as well, but that’s not going to give you 100% accuracy.
So I suggest you first create an image from the pdf with ghostscript.
Then use tesseract to do some ocr magic.
And then transform that output to hocr which you could give to the LLM in a prompt or which you could create embeddings from.
You could even create html from that and make a pdf
Thanks. I created an image of the pdf. And I used OCR too to extract the text. But i am unsure how LLM would know the corresponding mapping of the keys to the values of the table from the extracted text from image file of the pdf with table.
Suppose this is the table, what would i do with the extracted text from OCR? Would GPT model be able to give me corresponding correct values for each questions asked?
yes, but it really depends on your ability to prompt.
Oh okay, will try. Also i would like to ask a question. Suppose if I have a lot of points and sub points and sub sub points in my pdf file, which I want to feed to the embedding and sequentially to the gpt model, what suggestions do you have to have an effective output? I also want to know what sort of text cleaning should we do generally?
Sounds like a tree to me.
Maybe you also want to checkout this:
Hey @aditigajurel32 i’m also trying to do something similar , were you able to achieve expected result with all the steps mentioned in the chat thread ?
Hey @aditigajurel32 @jochenschultz
I’m practically trying to build a LLM model with word document that has tabular data with 5 columns(attributes) and one of the column is regex . I understood that LLM parses with only sequential data . The task that I have at hand is to send an error message to the model and model must be capable to find a regex match and give me the category name . The challenge that I’m facing is I’ve converted the document tabular data to initially Dataframe and from dataframe converted to tostring values . But because of this I guess the column name to row value mapping integrity has been lost. I’m completely a rookie in this aspect, could anyone suggest on this ? Any inputs are much appreciated in this aspect.
sounds like you need a spatial solution. Tables are pretty easily segmentable when you see rows and columns as areas with stuff inside.
Maybe you also want to check out this:
But to get deeper into it you need to explain what you mean by “build a llm with word document”?
I was assuming you want to extract tabular data from a document. Is that correct?
And what do you mean by regular expression? You don’t want to say there are regular expressions in the fields, right? You meant that you are extracting values by matching against regex?
To give you a better clarification attached is the screen shot of how my document looks.
My whole document is data described in table format. One of the column is a regular expression using which is need to perform retrieval task of Q+A.
The harder I try to understand what you are trying to say the more confused I become.
What do you need to do?
-
extract data from that document?
-
isn’t it a document that describes what you should do?
There must be more. Like some documents, datasources,etc. that you want to extract data from, right?
I mean yeah you can use llm to create regular expression templates. Not good enough to build automation solely on that though… by far not.
Just to clarify , please let me know if you are able to see the image . If yes , the column "Pattern Name " hold regex for the error messages. Now , when I given any error message for instance :- “Combobox not found for label Executable Action in Transaction MIGO”, the LLM model should be capable of understand the semantics in this sentence and find a regular expression that can match the sentence. Correspondingly , it should resturn the Category Name
Just to clarify I understand you are able to see the image that was attached in the previous post . If yes , then the column “Pattern name” is a regular expression . Now when I prompt my LLM model with a prompt asking to give me the category name for error message “Combobox not found for label Executable Action in Transaction MIGO” , my model should be capable enough to understand the semantics of the message , search and match the regex post which it should generate me a recorresponding "Category name " from the table.
so you got some data and when it matches a pattern in your table you want to classify the data based on that?
Those are error messages in SAP and you want to build a classifier for them using AI?
-
- extract data from that document? : To answer this , the model should be able to perform a pattern match to the regular expression for the given input error message . After the that pattern match model should be able to retrieve the respective recommendation for that category name. Could you please let me know if the image is visible to you.
-
- isn’t it a document that describes what you should do? : This document stands as a source for LLM and what the LLM has to retrieve from . Not sure if I have answered your questions to the fullest tho.
Let me know for more clarification.
Not a classifier , but just a chatbot that can help me answer my questions regarding the error message.
i just need to know how can I parse the tabluar data to LLM. Like is there any library or transformer that I can use of?
Yeah, then my first answer is right. Spatial…
Which means you run the document through OCR with hocr output format… then look at that and tell me if that is usable