Looking for the Tools ChatGPT Uses for PDF Chunking and Conversion

ptrader · January 10, 2025, 8:43pm

I need to chunk a few PDF files for vectorization and to create a RAG (Retrieval-Augmented Generation) application. I’d like to use the same tools ChatGPT uses when I upload a document. Do we know the tool or library used by ChatGPT to chunk and convert PDF files?

jochenschultz · January 11, 2025, 10:55am

Which part of it is unclear?

What kind of pdf?

Does it have a textlayer that can be extracted?

Any images inside the pdfs that need to be extracted?

Want to use OCR on top of text layer extraction?

Any limitations on pdf’s # of pages?

Any requirements regarding speed of data extraction?

I mean there is langgraph which has a very basic document loader… You may try that.

ptrader · January 13, 2025, 4:21am

Thanks for your response, but my question is specifically about identifying the tools or libraries ChatGPT (or OpenAI) uses for chunking and processing PDFs.

Your questions about PDF structure (e.g., text layers, OCR, images) aren’t relevant to what I’m asking. I’m looking for the exact tools used by ChatGPT or close equivalents.

jochenschultz · January 13, 2025, 4:45am

You shouldn’t… I just made yet another test. Their data extraction especially from pdf is really basic - on my test pdf with 117 pages it could solve only ~50% of the data extraction.
And the pdf has a textlayer… so no need for OCR - I would guess they use PyPDF2 for that then.
But they don’t use multi page spatial grouping… at least it looks like this.

[edit] and it is quiet common that information blocks spawn over multiple pages.

Let me give you an example:

page 1 of the pdf

fruits

apple
grapes
…

page 2 of the pdf

a totally made up name for a special fruit that is only produced by 1 guy who has never promoted it anywhere
some other fruit

…

and then you upload and prompt

"count the fruits in that document"

then it won’t find the fruits on the second page correctly. It chunks the pdf per page only and since there is no more headline that declares what the following list items are it fails…

ptrader · January 13, 2025, 4:51am

The way ChatGPT’s Private GPT understands and responds based on uploaded PDFs is far above my expectations.

I just need to know what libraries they are using. I hope they are not proprietary, and I wish it were open source.

polepole · January 13, 2025, 5:16am

PyPDF2
PyMuPDF
PDFplumber

jochenschultz · January 13, 2025, 5:25am

You might want to checkout this here if the pdf has no textlayer (e.g. you make foto from an invoice and then embedd that into a word file and then export that to pdf - good luck using PyPDF2 then):

or finetune tesseract and combine it with yolo, opencv or other stuff of that kind (if you need more than just text)…

you need a good GPU (release of RTX5x series comes in handy I suppose) for that and cuda support or else that stuff might be really slow on your machine.

ptrader · January 13, 2025, 4:03pm

Thank you @polepole . I like your approach : using a clever prompt to reverse engineer ChatGPT .

Two follow up questions:

Question 1: Which model? GPT4o or o1?
Question 2: Can copy paste the prompts you used here

Thank you!

polepole · January 13, 2025, 7:41pm

I did not use API for it.

I used ChatGPT-4o. On o1 model, currently we cannot upload document files for now, but we can upload image files.

Prompt-Click here

You are tasked with processing a PDF document that contains a mix of text, images, and tables. The goal is to extract text and images from the PDF while preserving the layout and then combine everything into a single Word file. Follow the steps below:

Steps:

Extract Text:
- Use PyMuPDF (fitz) to extract all the text from the PDF page by page.
- Maintain the order of the text as it appears in the original PDF.
- If any issues occur during text extraction (e.g., unreadable sections), log the error and continue with the rest of the pages.
Extract Images:
- Identify all images on each page using PyMuPDF’s get_images method.
- Extract each image using PyMuPDF’s extract_image method and save them as separate image files.
- If an image extraction fails, log the error, skip the image, and proceed.
Combine Text and Images into a Word Document:
- Create a Word document using a library such as python-docx.
- Add the extracted text to the Word file sequentially.
- Insert images into the Word file at their approximate original positions based on their vertical positions in the PDF:
  - Use PyMuPDF’s bounding box data (bbox) for images and text to determine the order.
  - If bounding box data is unavailable or causes layout issues, add content sequentially by page while maintaining readability.
- Add page separators in the Word file to reflect the PDF’s page breaks (e.g., “— Page 1 —”).
Save the Word Document:
- Save the Word file with all the extracted text and images included in the correct order.

Notes:

Resize images to fit neatly within the Word file (e.g., width of ~4.5 inches).
For pages with only images or where text overlaps with images, prioritize readability over exact layout replication.
In case of errors with libraries or processing, log the issues and complete the task with the remaining data.

Deliverables:

A Word file containing:
- Extracted text and images in their original order and layout.
- Page breaks matching those in the original PDF.
A summary of any errors encountered during processing, if applicable.

If you are ready, I will provide my PDF file?

I uploaded 2 pages pdf from o1-system-card, this is how looks pdf pages:

This is output as Word file:

ptrader · January 14, 2025, 12:33am

This is an effective prompt. Thank you!

In your first reply, you included three screenshots of a prompt results. The prompt caused GPT-4 to produce code that revealed the Python library it used.

Is there any chance you can share the prompts that resulted in your first reply’s screenshots?

polepole · January 14, 2025, 1:10am

I did not use any special prompt for three screen output.
I just use simple prompt below, only to show you which libraries are used in ChatGPT.

I just said:

“Use PyPDF2, and extract page 3 from the pdf file I uploaded”
“Use PyMuPDF (fitz), and extract page 4 from the pdf file I uploaded”
“Use pdfplumber, and extract page 5 from the pdf file I uploaded”

Also, you can ask to ChatGPT to search on web to compare those three pdf libraries with this prompt:

Create a detailed comparison table for the Python libraries PyPDF2, pdfplumber, and PyMuPDF (fitz). The table should include the following columns: Feature/Library, PyPDF2, pdfplumber, PyMuPDF (fitz), and Supported Formats. Use green checkmark (✔️) for supported features and red cross (❌) for unsupported ones. The features to compare are:

Text Extraction
Scanned PDFs
Table Extraction
Image Extraction
Speed
PDF Manipulation
Ease of Use
Supported Formats (mention the specific file formats supported by each library)

Make sure to include a legend explaining the symbols and provide a brief note about each library's capabilities.

ptrader · January 14, 2025, 3:14am

@polepole, your approach to solving problems using clever prompts was inspiring to me.

Do you recommend or know of any resources that provide training or guides on how to create smart prompts like you do?

polepole · January 14, 2025, 3:37am

Thanks for your kind words! @ptrader

You can learn from OpenAI Best Practices - Prompt Engineering

I use similar techniques in My GPTs, and I get help from them.

system · January 16, 2025, 3:38am

This topic was automatically closed 2 days after the last reply. New replies are no longer allowed.

Topic		Replies	Views
What are the limitations of GPT-4 in analyzing PDF text? Prompting gpt-4	6	32319	March 12, 2024
Obtaining correct PDF page number in the response using GPTs Prompting gpt-4 , gpts	12	4572	June 27, 2025
Accurately read PDF files? API	12	79863	December 12, 2023
What is the exact method used by ChatGPT 4o to read PDFs? API pdf	0	627	January 31, 2025
Split context and prompt into two requests Prompting chatgpt , api	10	8512	December 25, 2023

Looking for the Tools ChatGPT Uses for PDF Chunking and Conversion

fruits

Steps:

Notes:

Deliverables:

Related topics