I need to chunk a few PDF files for vectorization and to create a RAG (Retrieval-Augmented Generation) application. I’d like to use the same tools ChatGPT uses when I upload a document. Do we know the tool or library used by ChatGPT to chunk and convert PDF files?
Which part of it is unclear?
What kind of pdf?
Does it have a textlayer that can be extracted?
Any images inside the pdfs that need to be extracted?
Want to use OCR on top of text layer extraction?
Any limitations on pdf’s # of pages?
Any requirements regarding speed of data extraction?
I mean there is langgraph which has a very basic document loader… You may try that.
Thanks for your response, but my question is specifically about identifying the tools or libraries ChatGPT (or OpenAI) uses for chunking and processing PDFs.
Your questions about PDF structure (e.g., text layers, OCR, images) aren’t relevant to what I’m asking. I’m looking for the exact tools used by ChatGPT or close equivalents.
You shouldn’t… I just made yet another test. Their data extraction especially from pdf is really basic - on my test pdf with 117 pages it could solve only ~50% of the data extraction.
And the pdf has a textlayer… so no need for OCR - I would guess they use PyPDF2 for that then.
But they don’t use multi page spatial grouping… at least it looks like this.
[edit] and it is quiet common that information blocks spawn over multiple pages.
Let me give you an example:
page 1 of the pdf
fruits
- apple
- grapes
…
page 2 of the pdf
- a totally made up name for a special fruit that is only produced by 1 guy who has never promoted it anywhere
- some other fruit
…
and then you upload and prompt
"count the fruits in that document"
then it won’t find the fruits on the second page correctly. It chunks the pdf per page only and since there is no more headline that declares what the following list items are it fails…
The way ChatGPT’s Private GPT understands and responds based on uploaded PDFs is far above my expectations.
I just need to know what libraries they are using. I hope they are not proprietary, and I wish it were open source.
You might want to checkout this here if the pdf has no textlayer (e.g. you make foto from an invoice and then embedd that into a word file and then export that to pdf - good luck using PyPDF2 then):
or finetune tesseract and combine it with yolo, opencv or other stuff of that kind (if you need more than just text)…
- you need a good GPU (release of RTX5x series comes in handy I suppose) for that and cuda support or else that stuff might be really slow on your machine.
Thank you @polepole . I like your approach : using a clever prompt to reverse engineer ChatGPT .
Two follow up questions:
Question 1: Which model? GPT4o or o1?
Question 2: Can copy paste the prompts you used here
Thank you!
I did not use API for it.
I used ChatGPT-4o. On o1 model, currently we cannot upload document files for now, but we can upload image files.
Prompt-Click here
You are tasked with processing a PDF document that contains a mix of text, images, and tables. The goal is to extract text and images from the PDF while preserving the layout and then combine everything into a single Word file. Follow the steps below:
Steps:
-
Extract Text:
- Use PyMuPDF (fitz) to extract all the text from the PDF page by page.
- Maintain the order of the text as it appears in the original PDF.
- If any issues occur during text extraction (e.g., unreadable sections), log the error and continue with the rest of the pages.
-
Extract Images:
- Identify all images on each page using PyMuPDF’s
get_images
method. - Extract each image using PyMuPDF’s
extract_image
method and save them as separate image files. - If an image extraction fails, log the error, skip the image, and proceed.
- Identify all images on each page using PyMuPDF’s
-
Combine Text and Images into a Word Document:
- Create a Word document using a library such as
python-docx
. - Add the extracted text to the Word file sequentially.
- Insert images into the Word file at their approximate original positions based on their vertical positions in the PDF:
- Use PyMuPDF’s bounding box data (
bbox
) for images and text to determine the order. - If bounding box data is unavailable or causes layout issues, add content sequentially by page while maintaining readability.
- Use PyMuPDF’s bounding box data (
- Add page separators in the Word file to reflect the PDF’s page breaks (e.g., “— Page 1 —”).
- Create a Word document using a library such as
-
Save the Word Document:
- Save the Word file with all the extracted text and images included in the correct order.
Notes:
- Resize images to fit neatly within the Word file (e.g., width of ~4.5 inches).
- For pages with only images or where text overlaps with images, prioritize readability over exact layout replication.
- In case of errors with libraries or processing, log the issues and complete the task with the remaining data.
Deliverables:
- A Word file containing:
- Extracted text and images in their original order and layout.
- Page breaks matching those in the original PDF.
- A summary of any errors encountered during processing, if applicable.
If you are ready, I will provide my PDF file?
I uploaded 2 pages pdf from o1-system-card, this is how looks pdf pages:
This is output as Word file:
This is an effective prompt. Thank you!
In your first reply, you included three screenshots of a prompt results. The prompt caused GPT-4 to produce code that revealed the Python library it used.
Is there any chance you can share the prompts that resulted in your first reply’s screenshots?
I did not use any special prompt for three screen output.
I just use simple prompt below, only to show you which libraries are used in ChatGPT.
I just said:
“Use PyPDF2, and extract page 3 from the pdf file I uploaded”
“Use PyMuPDF (fitz), and extract page 4 from the pdf file I uploaded”
“Use pdfplumber, and extract page 5 from the pdf file I uploaded”
Also, you can ask to ChatGPT to search on web to compare those three pdf libraries with this prompt:
Create a detailed comparison table for the Python libraries PyPDF2, pdfplumber, and PyMuPDF (fitz). The table should include the following columns: Feature/Library, PyPDF2, pdfplumber, PyMuPDF (fitz), and Supported Formats. Use green checkmark (✔️) for supported features and red cross (❌) for unsupported ones. The features to compare are:
Text Extraction
Scanned PDFs
Table Extraction
Image Extraction
Speed
PDF Manipulation
Ease of Use
Supported Formats (mention the specific file formats supported by each library)
Make sure to include a legend explaining the symbols and provide a brief note about each library's capabilities.
@polepole, your approach to solving problems using clever prompts was inspiring to me.
Do you recommend or know of any resources that provide training or guides on how to create smart prompts like you do?
Thanks for your kind words! @ptrader
You can learn from OpenAI Best Practices - Prompt Engineering
I use similar techniques in My GPTs, and I get help from them.
This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.