How Does One Extract Sequentially From a PDF

thinktank · September 15, 2024, 3:28am

Hey everyone!

I’ve been trying to automate the conversion of a long pdf to a spreadsheet, and succeeded…as long as there is a prepopulated list of names for the search to use as an initial reference.

But try as I might, I cannot fully automate this first step of name retrieval.

The PDF has a semi-standardized format, but the only thing that indicates a new item is some whitespace and a slightly larger bold maroon font. Items can be of greatly varying length. Having two columns of data seems particularly challenging.

Mini can’t do it at all, (though it’s top-notch at finding something if you already know the name), but “give me a list of names as you find them in this pdf” and all it can seem to find are “Giant [miscellaneous creatures that don’t exist].”

4o’s attempts are “okay,” with only a few errors in the list…but far from safe enough to allow to run by itself. It usually goes off track after the first few entries when even looking for as few as five items.

Tried Everything OpenAI Offers

I’ve exhausted every possibility open AI offers:

Basic Prompting via a CustomGPT
Basic Prompting via an Assistant on the Playground with Mini and 4o
An Assistant with only one document in a Vector Store.
An Assistant with the score_threshold changed.
A Vector Store with a smaller chunking strategy, which I thought, in theory, would have greater overlap for the sections in question to help with multiple columns.
Various permutations of Temperature, Top_P, and Max_Num_Results.

Here’s the python code and the source document if you wanna give it a try.

(More interestingly, the script shows analytics from the search and makes it easy to play with settings and see relevant results.)

Score Threshold is Useful Overall

The new ranking_options are helpful, but they don’t constrain the model to read a PDF as a human does.

Examining the run steps, then setting a higher score_threshold DIDN’T help pull my list, but it DID significantly reduce prompt_tokens which was very useful for reducing costs when 4o was used to perform the search.
This made using 4o competitive to perform the search, which it was better at from the start, but prohibitively expensive.
You can only set a floor for score_threshold. It yields “no results” even if there are results below the floor—which makes sense—but it would be helpful if one could set a range. As it is, you have to examine the search results first, otherwise there’s no way to be sure of the score_threshold being awarded by the ranker.
In the future, it would be helpful to constrain a model to “reading a document as a human does,” if that’s even possible.
score_threshold is related to the overall model used to perform the search. 4o found chunks with a higher average threshold than those found by mini under identical circumstances.

It was possible to reduce the search results without affecting their quality, (which wasn’t great), but seems to indicate that the results I’m looking for are successfully being pulled in the first chunk.

Any Ideas What’s Next?

When it comes to automating data extraction from an irregular PDF to a CSV I’m to the point where I think it’s the most valuable to advise folks to create this initial names list manually, or with close super vision.

I think it would be several times faster to read the names off the list while ChatGPT listened, then to go through all of this. The goal is only to automate this process “as much as possible;” and there is something to be said of deliberately including human oversight at this stage.

That said, my next step is to create a small RAG flow and have an Assistant look at the current results then send them back through for correction and see if that helps the final results. Score_Threshold really does help minimize costs.

I am surprised this step is so difficult.

Any thoughts or insights are most welcome!

icdev2dev · September 19, 2024, 1:25am

So I was experimenting the excellent post by @dlaytonj2 here on batch api here Fun with the Batch API - An example

And decide to give your first page a go with text recognition from gpt-4o-mini in batch. I can successfully get the multi columnar thing to read properly.

**Appendix MM-A: Miscellaneous Creatures** 

This appendix contains statistics for various animals, vermin, and other critters. The stat blocks are organized alphabetically by creature name.

### Ape

- **Medium beast, unaligned**  
- **Armor Class:** 12  
- **Hit Points:** 19 (3d8 + 6)  
- **Speed:** 30 ft., climb 30 ft.  

**STR** | **DEX** | **CON** | **INT** | **WIS** | **CHA**  
16 (+3) | 14 (+2) | 14 (+2) | 6 (−2) | 12 (+1) | 7 (−2)  

**Skills:** Athletics +5, Perception +3  
**Languages:** —  
**Challenge:** 1/2 (100 XP)  

**Actions:**  
- **Multiattack:** The ape makes two fist attacks.  
- **Fist. Melee Weapon Attack:** +5 to hit, reach 5 ft., one target. Hit: 1d6 + 3 bludgeoning damage.  
- **Rock. Ranged Weapon Attack:** +5 to hit, range 25/50 ft., one target. Hit: 6 (1d6 + 3) bludgeoning damage.  

---

### Awakened Shrub

- **Small plant, unaligned**  
- **Armor Class:** 9  
- **Hit Points:** 10 (3d6)  
- **Speed:** 20 ft.  

**STR** | **DEX** | **CON** | **INT** | **WIS** | **CHA**  
3 (−4) | 8 (−1) | 11 (+0) | 10 (+0) | 10 (+0) | 6 (−2)  

**Damage Vulnerabilities:** Fire  
**Damage Resistances:** Piercing  
**Senses:** Passive Perception 10  
**Languages:** One language known by its creator  
**Challenge:** 0 (10 XP)  

**Actions:**  
- **Rake. Melee Weapon Attack:** +1 to hit, reach 5 ft., one target. Hit: 1 (1d4 − 1) slashing damage.  

An awakened shrub is an ordinary shrub given sentience and mobility by the **awaken** spell or similar magic.

---

### Awakened Tree

- **Huge plant, unaligned**  
- **Armor Class:** 13 (natural armor)  
- **Hit Points:** 59 (7d12 + 14)  
- **Speed:** 20 ft.  

**STR** | **DEX** | **CON** | **INT** | **WIS** | **CHA**  
19 (+4) | 6 (−2) | 15 (+2) | 10 (+0) | 10 (+0) | 7 (−2)  

**Damage Vulnerabilities:** Fire  
**Damage Resistances:** Bludgeoning, piercing  
**Senses:** Passive Perception 10  
**Languages:** One language known by its creator  
**Challenge:** 2 (450 XP)  

**Actions:**  
- **Slam. Melee Weapon Attack:** +6 to hit, reach 10 ft., one target. Hit: 14 (3d6 + 4) bludgeoning damage.  

An awakened tree is an ordinary tree given sentience and mobility by the **awaken** spell or similar magic.

---

### Axe Beak

- **Large beast, unaligned**  
- **Armor Class:** 11  
- **Hit Points:** 19 (3d10 + 3)  
- **Speed:** 50 ft.

The key insight was to provide the format in the system_context and the user_context

SYSTEM_IMAGE_READER_CONTEXT = “You are an expert at reading text in the image.”
USER_IMAGE_READER_CONTEXT = “The format is structured in multiple columns. Obviously the text must follow as a human would read it.”

Can you please take a quick look at let me know if the raw text extraction looks ok?

icdev2dev · September 19, 2024, 6:45am

I got tired of waiting for batch to complete. So extracted through chat completion.

The methodology went like this :

(a) seperated each page of PDF, saved as PNG (pymupdf)
a.1 extracted text out of each page ← LLM Use
a.2 extracted creatures names out of each page <-LLM Use

(b)
b.1 Compiled a total list of all creatures extracted from a.2 <-python
b.2 For each page text, removed footer from each page text <-python
b.3 Combined all pages to produce entire appendix in text <-python
b.4 removed hallucinated values from all creatures list ← python
b.5 corrected spelling on creatures list <-python

(c) Extracted text for each creature <-python

(d) Mapped text to structured output ← LLM Use

Sample for (c)



Quipper  
Tiny beast, unaligned  
Armor Class 13  
Hit Points 1 (1d4 - 1)  
Speed 0 ft., swim 40 ft.  
STR -2 (-4)  
DEX 16 (+3)  
CON 9 (-1)  
INT 1 (-5)  
WIS 7 (-2)  
CHA 2 (-4)  
Senses darkvision 60 ft., passive Perception 8  
Languages —  
Challenge 0 (10 XP)  

Blood Frenzy. The quipper has advantage on melee attack rolls against any creature that doesn’t have all its hit points.  
Water Breathing. The quipper can breathe only underwater.  

Actions  
Bite. Melee Weapon Attack: +5 to hit, reach 5 ft., one target. Hit: 1 piercing damage.  

A quipper is a carnivorous fish with sharp teeth. Quippers can adapt to any aquatic environment, including cold subterranean lakes. They frequently gather in swarms; the statistics for a swarm of quippers appear later in this appendix.

Sample for (d)

name='Quipper' description='A quipper is a carnivorous fish with sharp teeth. Quippers can adapt to any aquatic environment, including cold subterranean lakes. They frequently gather in swarms; the statistics for a swarm of quippers appear later in this appendix.' armor_class=['13'] hit_points='1 (1d4 - 1)' speed=Speed(walk='0 ft.', fly=None, swim='40 ft.', climb=None) ability_scores=['STR -2 (-4)', 'DEX 16 (+3)', 'CON 9 (-1)', 'INT 1 (-5)', 'WIS 7 (-2)', 'CHA 2 (-4)'] senses=Senses(blindsight=None, darkvision='60 ft.', passive_perception=8) skills=None languages=None challenge_rating='0' experience_points=10 abilities=['Blood Frenzy. The quipper has advantage on melee attack rolls against any creature that doesn’t have all its hit points.', 'Water Breathing. The quipper can breathe only underwater.'] actions=['Bite. Melee Weapon Attack: +5 to hit, reach 5 ft., one target. Hit: 1 piercing damage.'] lore=None

thinktank · September 28, 2024, 5:27pm

hahahaha
I apologize for my own delay. I had to spend some time digesting what you said.

So here, you programmatically split the PDF to all of it’s individual pages.

The approach makes sense given that the biggest difficulty is getting a Assistant to “turn the page,” but wouldn’t it take a lot of storage for long documents?

Or is the lesson here to always store data in the smallest possible size?

Can you please tell me more about this? How can you be sure that you’re removing hallucinated values, in b.4?

The biggest problem I’ve faced is the percentage of hallucinated values created by the methods I’ve been trying. I couldn’t figure out how to remove hallucinations with any degree of certainty.

icdev2dev · September 28, 2024, 6:05pm

Storage is relatively cheap. On the average, one page was 90 kb (29 pages in total). So doable for me even with two orders of magnitude higher.

Not really sure what was the lesson that I learnt here. But as you will see later in this post, for me the smallest possible size was the entire document because of the overlap of content between pages.

I do a two pass of all pages; the first to extract the entire text and the second to extract only the creature name. In the text, each creature name appears on a single line (as to be expected).

Concating all the pages with text produces the entire appendix in text format.
Concating all the creatures from all the pages produces the list of creatures (some hallucinated)
With the appendix and sorted set of creatures with the approximate count of when to encounter the next creature and ensuring that there isn’t a spelling mistake between the text and creature through levenstein distance (happened once), I can be reasonably assured that I am removing any hallucinated value(~5) and correcting spelling mistake(+1).

hth

mchip · September 28, 2024, 7:14pm

I too have been working with getting scanned PDFs into CSV, and I found that saving them in MS Word format is the way to go because any cleanup can be automated. The PDFs I’m working with were scanned from a 19th century series of book, each 400 pages long, so there are a lot of OCR artifacts, especially with numbers and italic text. Once you get the content into Word in plain text then extracting names to CSV is a piece of cake. Sometimes old school is the best way to go.

icdev2dev · September 28, 2024, 7:45pm

welcome @mchip

There were nuances in the appendix which went beyond OCR. One of particular one was the fact that this is was a multi (two) column text. Traditional OCR, in such settings, would read one line at a time; which, ofc, destroys the entire structure of the page.

I am finding, in a lot of other cases as well, that intermixing traditional old school with LLMs works in context of tasks which were frankly impossible/extremely hard to do before.

mchip · September 28, 2024, 8:31pm

Yes, I am finding the same thing as well. Intermixing works really, really well, especially when you are intimately familiar with the software one used to use to achieve the same result. I was rusty working with Word and Excel, and voila, ChatGPT to the rescue. It wrote really dreadful VBA code, but it got me back to the point where I could correct and debug it easily enough. That would have been next to impossible without prior knowledge. One thing you should check out if you need to programmatically manipulate global search/replace functions is the Dictionary object. You create a range of search/replace pairs in Excel and then you read them into a Dictionary object in Word VBA to instantly search/replace any number of text strings. My replace spreadsheet is already over 700 rows–I keep adding to it as I process more pdf’s and find more boo-boos.You could do the same thing with RegEx, but I found plain text in Excel easier to read and debug.

thinktank · September 30, 2024, 6:20pm

Apologies, I meant the lesson that I learnt. Imma noob.

Based on y’alls thoughts I’ve changed tact and begun experimenting with initial file conversion and removing AI from the equation.

Since IC’d has a method using PNG, I took a different direction.

I conversed with o1mini a bit on what the better option would be, and it suggested docx was ideal (over rtf) because of the meta data sent along with stylistic information. But, given what @icdev2dev said about accuracy, I was concerned about said conversion and tried it twice.

Check this out.

Using pdf2docx

I had o1-mini write me a short script to do the conversion. I’m probably missing things, but this is definitely not useable.

Here's the Script

from pdf2docx import Converter
import os

def convert_pdf_to_docx(pdf_path, docx_path):
    """
    Converts a PDF file to a DOCX file.

    :param pdf_path: Path to the input PDF file.
    :param docx_path: Path where the output DOCX file will be saved.
    """
    if not os.path.exists(pdf_path):
        print(f"Error: The file {pdf_path} does not exist.")
        return

    try:
        # Initialize the Converter
        cv = Converter(pdf_path)
        # Convert the PDF to DOCX
        cv.convert(docx_path, start=0, end=None)
        cv.close()
        print(f"Conversion successful! DOCX file saved at: {docx_path}")
    except Exception as e:
        print(f"An error occurred during conversion: {e}")

def main():
    print("PDF to DOCX Converter using pdf2docx")
    pdf_path = input("Enter the path to the PDF file: ").strip('"').strip("'")
    docx_path = input("Enter the desired path for the DOCX file: ").strip('"').strip("'")

    # Ensure the output path has a .docx extension
    if not docx_path.lower().endswith('.docx'):
        docx_path += '.docx'

    convert_pdf_to_docx(pdf_path, docx_path)

if __name__ == "__main__":
    main()```

Using Native Adobe Acrobat Tools

I never noticed the “convert” tab in Acrobat before this, so I gave it a whirl.

This basically did everything I wanted immediately. Look how it copied the original format 1-for-1. Pretty neat.

It actually automatically converted the names into a heading, so all we have to do is remove the text in-between headings to get the final list of names. And it should be 100% accurate.

Adobe has a whole library of SDKs for this, so I think the process can be automated with a high degree of trust in the result.

I think you’d still have to actually look at the document after conversion with this. I am actually not sure at all how you could do these conversions without human eyes who understand said document at key moments in the process.

mchip · September 30, 2024, 7:38pm

Oh definitely. It takes more than one pass to proof the docs after the automated cleanup. It would be impossible to do this without human editing–there is no technology that can fix gibberish. In my case the OCR conversion from PDF to Word is full of weird interpretations such as $ for S, O for 0, 4 is sometimes th, “bis” for “his”, etc.so I’ve had to download two different pdf’s that were scanned at different times (the print docs date from the 1890’s and are translations from the Dutch–they are held in various libraries where they were scanned at different times, usually more than a decade ago). Sometimes entire paragraphs look like the text below, so they have to be proofed against a printed copy. I’m up to my fifth volume, and so far I’ve been able to find legible scanned pages by juggling two different scans. It’s a fun project–pulling the Minutes of the Court of New Amsterdam from 1653 to 1674 out of dusty oblivion. When I’m done, it will be the first time in 400 years that people will be able to see (and search) them in their entirety.
. : " Leeoden Dirciizct A’at Vaiit. pit:, vj Â£igmisn !".

thinktank · October 1, 2024, 6:12pm

Sick.

Can you give the adobe conversion a whirl? I’m curious if it will do any better with your document.

My original document (the D&D SRD) was probably made in either Word or inDesign initially, so it’s not exactly unexpected that the adobe api can convert it so well.

mchip · October 1, 2024, 6:51pm

I don’t have time myself, but give it a whirl. I tried to reply earlier but was told I can’t include hyperlinks–it’s on the internet archive, so just replace the - characters with the appropriate / character.
archive.org-details-recordsnewamste04ygoog-page-n6-mode-2up
It’s vol 5, which includes 1664 when Peter Stuyvesant had to surrender the place to the English. You’ll need to trim several pages from the beginning and the end.

thinktank · October 4, 2024, 1:41am

Working Solution!

Thanks much @icdev2dev, taking it back a step from AI; and @mchip looking into word (docx), lead to the solution.

I got it working in three steps. AI did most of the code-writing.

Convert the .pdf into a .docx using the Adobe Extractor API and their SDK.
Visually identify the heading type.
Programmatically remove everything but the heading (Using Docx SDK), then add it all to a csv.

Results

Unique Identifier,Heading
dndgpt_misc_monsters_0001,Ape
dndgpt_misc_monsters_0002,Awakened Shrub
dndgpt_misc_monsters_0003,Awakened Tree
dndgpt_misc_monsters_0004,Axe Beak
dndgpt_misc_monsters_0005,Baboon
dndgpt_misc_monsters_0006,Bat
dndgpt_misc_monsters_0007,Badger
dndgpt_misc_monsters_0008,Black Bear
dndgpt_misc_monsters_0009,Blink Dog
dndgpt_misc_monsters_0010,Blood Hawk
dndgpt_misc_monsters_0011,Boar
dndgpt_misc_monsters_0012,Brown Bear
dndgpt_misc_monsters_0013,Camel
dndgpt_misc_monsters_0014,Cat
dndgpt_misc_monsters_0015,Constrictor Snake
dndgpt_misc_monsters_0016,Crab
dndgpt_misc_monsters_0017,Crocodile
dndgpt_misc_monsters_0018,Death Dog
dndgpt_misc_monsters_0019,Deer
dndgpt_misc_monsters_0020,Dire Wolf
dndgpt_misc_monsters_0021,Draft Horse
dndgpt_misc_monsters_0022,Eagle
dndgpt_misc_monsters_0023,Elephant
dndgpt_misc_monsters_0024,Elk
dndgpt_misc_monsters_0025,Flying Snake
dndgpt_misc_monsters_0026,Frog
dndgpt_misc_monsters_0027,Giant Ape
dndgpt_misc_monsters_0028,Giant Badger
dndgpt_misc_monsters_0029,Giant Bat
dndgpt_misc_monsters_0030,Giant Boar
dndgpt_misc_monsters_0031,Giant Centipede
dndgpt_misc_monsters_0032,Giant Constrictor Snake
dndgpt_misc_monsters_0033,Giant Crab

I got the name lists I needed 100% hallucination free (lol). (The spell list is 320 items long). But there are still some tweaks needed to deal with the various formats.

I think this Identifying Step can be handled by AI in order to make this more general use technique.

Convert the Document, then have a model (or fine-tuned vision model) look at a few pages of the document to determine the internal hierarchy, then return said hierarchy in structured response or a tool call.

mchip · October 4, 2024, 10:51am

But there are still some tweaks needed to deal with the various formats.

If you’re in Word use the built-in styles to control formatting. Normal (or Body Text) is the built-in style for paragraph formatting and there’s a check box to automatically update it, which controls alignment, tabs, character formatting, etc. Heading 1 is the default for headings, useful for outlining. Turn on the display for all formatting characters except spaces. To convert to csv, use tabs, not crlf/paragraphs to delimit rows. Word is very powerful, but settings are buried in the ribbon and property dialogs–the UI is non-intuitive and F1 help is often unclear unless you are familiar with the terminology. I’ve been working intermittently with Word since it was ported from DOS, and I often can’t find some feature because they moved it around from one version to the next.

thinktank · October 6, 2024, 5:18pm

Oh yes, I am very familiar with Word.

That’s why Adobe’s api is impressive to me, they transfer over all of that rich data. The weird initial results from pdf2docx were more what I was expecting.

Given that these styles are readable, what I meant was that I think one can write a script where an Agent pops in, reads said format, then makes a judgement call based on user input.

That list of hallucination-less headings is a powerful tool.

Topic		Replies	Views
Problem extracting data from PDF files and comparing them Prompting gpt-4 , chatgpt	20	4976	June 7, 2025
Poor quality response on trained LLM with pdf files Community gpt-4	29	6246	May 1, 2024
How to confirm that you got the correct value from a text other than repeating the same prompt over and over API	39	876	September 1, 2024
Best way to interact with PDF 2025 API chatgpt , api , pdf , assistants-api	47	5263	May 18, 2025
Preparing data for embedding API	33	14584	December 16, 2023