Hello,
I have a question: I have PDF of 2396 pages, I will extract PDF text by using PyMuPDF and what I want is to replace some words such as title in it, Join the broken sentences.
Suppose a title is “The way to get lock”, it should automatically think and replace these words with h1 tag.
The question is how can I send whole text and receive formatted text from GPT.
Model is not restriction I can use any that is provided by OpenAI.
Is that only feasible with a Document QA bot with Langchain, which I found on almost every article I’ve read so far.
To start, it’s crucial to understand that when you open a PDF with a text editor, you often won’t find the document’s text. Even searching for specific words within the text might fail. This is due to the PDF’s complex internal structure, making text extraction and find-and-replace operations less straightforward than one might expect.
The reason many bots use Langchain is because they recognize that AI struggles to directly extract information from PDFs due to their intricate internal structure. Therefore, they rely on software to perform text extraction. However, this software might not preserve the metadata required to identify the header accurately.
Questions like these about PDFs are common. To fully grasp the reasons behind certain practices, it’s essential to have an understanding of the PDF’s internal workings.
If the process you are using is generating broken sentences, then you might needs something that does not generate broken sentences, perhaps commercial software.
I have done some proof of concept code for extracting information from PDFs and have encountered sentences with additional information that, if considered solely as individual words, could disrupt the sentence structure. When parsing the raw PDF data and constructing sentences, there is often abundant contextual information available that makes it easier to form coherent sentences, thus no broken sentences.
I’ve noticed NLTK mentioned before, and it might work. However, I have no prior experience with it.
The key question for you is whether the goal of achieving your desired level of precision is justified by the effort or cost required.
Okay, If GPT API for that. Is there any default value for new tokens or How can I say GPT API to use as much tokens as much you want to generate content.
Models with higher token limits also come with higher costs. Even if you use a model with a higher token limit, your PDF might still be too large for conversion.
It appears you have more tasks in mind beyond simply replacing words in the PDF and converting it to HTML. I’m currently uncertain about your specific needs.
If your sole objective is to convert a large PDF to HTML for viewing, you can refer to this post:
I’m familiar with this concept as traceability. I’ve encountered it in aviation software and medical knowledge as well.
As for PDF software, I’m not aware of any that currently offer a traceability feature. However, since I don’t purchase commercial software for PDFs, there might be options available.
If this functionality is a requirement, it may not be feasible to rely solely on an LLM due to potential inaccuracies or hallucinations. You might need Human-in-the-loop assistance.
How would you break this down into smaller steps to modify presentation, ensure accuracy and traceability? I’m asking to understand your thought process better and see where others or I can provide helpful insights. Currently, we’re operating at a high-level, which makes the feedback more subjective than objective.
I will get the PDF text and process 2 pages at the time after that all proceeded then I will again use a loop to check if any paragraph is broken or not and some regular checks, but it will be from page (2,3) (4,5) rather than page (1,2) (3,4)
How do you intend to identify where navigation links go?
How do you plan to store the links associated with the text?
How do you plan to present the pages?
If you edit the text after extraction and get an update to the PDF how do you sync up the changes?
What about images in the PDF?
Where will all of this run?
Will it be available to the public?
Sorry for so many questions. As I’ve mentioned to others, it’s essential to have a clear understanding of your ideas before you start coding or working on projects. While you can certainly create proof of concept components, it’s crucial to comprehend all aspects fully before beginning. Otherwise, you might encounter difficulties and end up wasting a significant amount of time or not even being able to complete the project.
You do understand that PDFs were designed primarily for presenting information in a static page format, typically for printing, rather than as a format for easy data extraction. This is why many people encounter difficulties when attempting to extract data from PDFs using software.
Question 1, 2:
For the navigation links I will add prompt to assign a unique id which will be a text of containing as some words what I want to replace aren’t longer it will be okay for me.
Using BS4 or In client with JavaScript gather the all elements containing Ids and represent it with anchor tag and hash routing.
Question 3:
All the collected text and HTML will be only in one file.
I will append each text in list later I will join the chunks.
Question 4:
Looking for the cron Job on AWS or any Cloud computing platform, but currently will be running on Local PC.
Question 5:
Yep the project is for the public, but after all checks and tests.
If I understand correctly, you will have one HTML file containing all the text and associated HTML sourced from a 2396-page PDF file.
I have to ask, are you new to programming and handling something of this magnitude?
How large do you anticipate the file size to be?
Do you believe it can be displayed as an HTML page?
Sorry for the skepticism, but I find it challenging to consider such a large amount of information in a single file.
Why is this specific PDF so significant?
I have a large PDF file containing technical content that I’m processing for use with AI, think chat with PDF. While I have experience in extracting information from PDFs, I still face some significant challenges. Instead of using AI for extraction, I plan to use RAG or convert the data into semantic triples before embedding. It’s important to understand the task and the challenges involved. Accomplishing it correctly requires strong programming skills. While some parts can be done with free or commercial software , if used you may be at their mercy if you need something and they don’t plan to do it.
I reviewed some of my PDF files, and the largest one contains 23,124 pages. No, that’s not a typo. The file is just to large and takes some time to view and navigate this PDF using Adobe Acrobat Reader; typically ending by aborting the reader with a kill command. The same holds for a 3,786-page PDF.