Search long pdf for specific table - possibly need fine tuning model

I have many long (100+ page) pdfs.
There is 1-2 specific tables I need to grab some each pdf although the page in which they are on is never the same. I am interested in the executive and non-executive director compensation/renumeration tables
Not only that but the format is always different (obviously not entirely different, but enough to not be able to write easy code to grab it)

I have tried to use a python pdf to text package and then openai api to find the table but have not got anywhere meaningful. Perhaps I can use a fine tuning model to show open ai what I am looking for? Any suggestions?

example pdf: https://cdn-api.markitdigital.com/apiman-gateway/ASX/asx-research/1.0/file/2924-02701314-2A1468465?access_token=83ff96335c2d45a094df02a206a39ff4

The 2 tables i am interested in from this example:


Here is my code atm:
I have new code and have posted it in reply to this

Okay I have made progress, but still looking for help.

This is my most recent code:

import PyPDF2
import re
import openai
import os

# Function to count numbers in a string
def count_numbers(text):
    return len(re.findall(r'\b\d+\b', text))

# Find likely pages containing the table
likely_pages = []
with open('SHA_full.pdf', 'rb') as file:
    pdf_reader = PyPDF2.PdfReader(file)
    num_pages = len(pdf_reader.pages)

    for i in range(num_pages):
        page = pdf_reader.pages[i]
        text = page.extract_text()

        if any(keyword.lower() in text.lower() for keyword in ['key management personnel', 'KMP', 'salary']):
            num_count = count_numbers(text)
            if num_count > 30 and '$' in text:
                likely_pages.append((i + 1, num_count))

# Sort likely_pages by the number count and limit to top 5
likely_pages = sorted(likely_pages, key=lambda x: x[1], reverse=True)[:5]
likely_pages = [x[0] for x in likely_pages]               

# Initialize OpenAI API
openai.api_key = "sk-XXX"  # Replace with your actual OpenAI API key
my_ai_model = "gpt-3.5-turbo"

pdf_file = "SHA_full.pdf"  # Replace with your actual PDF file

# Function to process the text through AI
def aiprocessor(page_no, text):
    print(f"\n\n..AI processing page {page_no}")
    messages = [
        {
            "role": "system",
            "content": """You are a PDF table extractor, a backend processor.
- User input is messy raw text extracted from a PDF page by PyPDF2.
- Do not output any body text, we are only interested in tables.
- The goal is to identify tabular data, and reproduce it cleanly a table.
- Look for the table that shows the remuneration for the executives and the non-executives. Do your best in recognises which table it is!
- Only print 3 tables max. before you print the first table be sure it is the one i want, tehre should be numbers and names and a few columns at least
- finally the table that you did choose, tell me the page number in the pdf of that."""
        },
        {
            "role": "user",
            "content": "raw pdf text; extract and format tables:" + text
        }
    ]

    try:
        api_response = openai.ChatCompletion.create(model=my_ai_model, messages=messages, stream=True)
        reply = ""
        for delta in api_response:
            if not delta['choices'][0]['finish_reason']:
                word = delta['choices'][0]['delta']['content']
                reply += word
                print(word, end="")
        return reply
    except Exception as err:
        error_message = f"API Error page {page_no}: {str(err)}"
        print(error_message)

# Create a list to store AI-processed text
ai_processed_text_list = []

# Open the PDF file in binary mode
with open(pdf_file, 'rb') as pdf_file:
    pdf_reader = PyPDF2.PdfReader(pdf_file)

    # Iterate only through the likely pages and extract text
    for page_num in likely_pages:
        page = pdf_reader.pages[page_num - 1]
        page_text = page.extract_text()

        if len(page_text) > 20:
            # Process with AI
            ai_processed_text = aiprocessor(page_num, page_text)

            # Append the AI-processed text to the list
            ai_processed_text_list.append(ai_processed_text)

# Combine all AI-processed text into a single string
combined_text = "\n".join(ai_processed_text_list)

# Define the output text file name (same root name as the PDF)
output_text_file = pdf_file.name + "-AI-all.txt"

# Save the combined text into a .txt file
with open(output_text_file, 'w', encoding='utf-8') as output_file:
    output_file.write(combined_text)

print(f"AI-processed text saved to {output_text_file}")

Code output

..AI processing page 46
Table 6: Statutory Non-Executive Director Remuneration for FY23
Name               | Role                          | Board and Committee Fees | Superannuation | Total      
-------------------|-------------------------------|--------------------------|----------------|--------------
Greg Miles         | Chairman, Non-Executive Director | $166,666                 | -              | $166,666
Phillip Arnall     | Chairman, Non-Executive Director | $28,333                  | -              | $28,333 
Michael Barnes     | Non-Executive Director          | $99,548                  | $10,452        | $110,000 
Jane Lloyd         | Non-Executive Director          | $110,000                 | -              | $110,000 
Gerard McMahon     | Non-Executive Director          | $24,887                  | $2,613         | $27,500 
Kathy Parsons      | Non-Executive Director          | $117,647                 | $12,353        | $130,000 
Jim Sloman OAM     | Non-Executive Director          | $130,000                 | -              | $130,000 
Craig van der Laan | Non-Executive Director          | $117,647                 | $12,353        | $130,000 

The table is located on page 4 of the PDF.

..AI processing page 43
Table 4: Actual Executive Remuneration
Name             | Role               | Fixed Super Package | Total STI Awarded Following Completion of the Financial Year | Value of LTI (SELTI) that Vested Following Completion of the Measurement Period/FY21 | Amount % of TRP | Amount* % of TRP | Amount % of TRP | Total Remuneration Package (TRP)
-----------------|--------------------|---------------------|-----------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------|------------------|-----------------|----------------------------
Peter Marix-Evans | CEO & Managing Director | $850,000 | 59% | $595,000 | 41% | $0 | 0% | $1,445,000
Scott Jamieson | CFO & Joint Company Secretary | $509,000 | 63% | $305,400 | 37% | $0 | 0% | $814,400

Page number: 43

..AI processing page 48
Table 1: Executive KMP
|      Name       | Tranche FY23 LTI Performance Rights | Grant Type |         Vesting Conditions         | Grant Date | Grant Number | Total Value at Grant | Value Expensed in FY23 | Max Value to be Expensed in Future Years |
|-----------------|------------------------------------|------------|----------------------------------|------------|--------------|----------------------|-----------------------|-----------------------------------------|
| Peter Marix-Evans |               228,925                |    LTI     |         Net Profit Growth        | 30/09/2022 |    340,000   |       113,333        |         226,667        |                 $                 328,467                 |
|  Scott Jamieson |               102,814                |    LTI     |         Net Profit Growth        | 30/09/2022 |    152,700   |         50,900         |            101,800          |                 $                 164,233                 |
|      Totals       |               331,739                |            |                                      |                |              |         492,700         |              164,233              |                 $                 492,700                 |

Table 2: Service Agreements for Executive KMP
|        Name         |   Position Held at Close of FY23    |       Employing Company       | Duration of Contract | From Company | From KMP | Termination Payments* |
|---------------------|-------------------------------------|--------------------------------|----------------------|--------------|-----------|----------------------|
| Peter Marix-Evans |           CEO & Managing Director          | Shape Australia Corporation Limited |        Ongoing        |   12 months   | 12 months |           -           |
|   Scott Jamieson  |           CFO & Joint Company Secretary           | Shape Australia Corporation Limited |        Ongoing        |    3 months   | 3 months  |           -           |

Please note that the first table (Table 1) shows the remuneration for the Executive KMP, and it is located on Page 1 of the PDF.

..AI processing page 44
Table 5: Statutory Executive Remuneration

|        | Peter Marix-Evans   | Scott Jamieson       |
|--------|--------------------|---------------------|
| Role   | CEO & Managing Director | CFO & Joint Company Secretary |
| Fixed PaySalary  | $825,103           | $486,571               |
| Super  | $27,500            | $27,500                |
| Other Benefits | $2,841             | $2,841                 |
| Total Fixed Pay | $855,444           | $516,912               |
| Variable RemunerationCash STVR * | $389,015           | $188,457               |
| LTI ** | $61,040            | $27,150                |
| Total Remuneration Package (TRP) | $1,305,499         | $732,519               |
| Variable Remuneration as a % of TRP | 34%              | 29%                    |
| Change in Accrued Leave | $11,006            | $59,741                |

Note:
* The STVR value reported in this table is the STVR that was paid during the reporting period, being the award earned during the previous period. Incentive outcomes for the reporting and previous period are outlined elsewhere in this report.
** The LTVR value reported in this table is the amortised accounting charge of all grants that have not lapsed or vested as at the start of the reporting period.

The table above shows the Statutory Executive Remuneration for Peter Marix-Evans and Scott Jamieson. The page number in the PDF where this table is found is not provided.

..AI processing page 95
Table 1:
+--------------------------------+--------------+--------------+
|             Title              |    2023      |    2022      |
+--------------------------------+--------------+--------------+
| Short-term employee benefits   |  2,712,546   |  3,019,173   |
| Post-employment benefits       |    92,771    |   109,343    |
| Long-term benefits             |    47,756    |    22,147    |
| Share-based payments           |    88,190    |  (247,207)   |
| Total Compensation             |  2,941,263   |  2,903,456   |
+--------------------------------+--------------+--------------+

Table 2:
+------------------+-------------------------------------------------+
|      Plan        |                    Description                    |
+------------------+-------------------------------------------------+
|       SELTI      | Employee incentive plan for senior management     |
|                  | and other eligible employees                       |
|       ESAP       | Employee incentive plan for all Group employees    |
+------------------+-------------------------------------------------+

Table 3:
+------------------------------------------------------------------+
|                     Company Details                              |
+------------------------------------------------------------------+
|   Company Name            |   SHAPE Australia Corporation Limited       |
|   ABN                     |   14 654 729 352                           |
|   ASX                     |   SHA                                      |
+------------------------------------------------------------------+

The table that shows the compensation for key management personnel is Table 1. The page number of this table is 27.AI-processed text saved to SHA_full.pdf-AI-all.txt

You can see it finds the 2 tables I want (reposted below). I guess now the step is for it to ONLY find these two tables. Any ideas?. Thanks

Table 5: Statutory Executive Remuneration

Peter Marix-Evans Scott Jamieson
Role CEO & Managing Director CFO & Joint Company Secretary
Fixed PaySalary $825,103 $486,571
Super $27,500 $27,500
Other Benefits $2,841 $2,841
Total Fixed Pay $855,444 $516,912
Variable RemunerationCash STVR * $389,015 $188,457
LTI ** $61,040 $27,150
Total Remuneration Package (TRP) $1,305,499 $732,519
Variable Remuneration as a % of TRP 34% 29%
Change in Accrued Leave $11,006 $59,741

…AI processing page 46
Table 6: Statutory Non-Executive Director Remuneration for FY23

Name Role Board and Committee Fees Superannuation Total
Greg Miles Chairman, Non-Executive Director $166,666 - $166,666
Phillip Arnall Chairman, Non-Executive Director $28,333 - $28,333
Michael Barnes Non-Executive Director $99,548 $10,452 $110,000
Jane Lloyd Non-Executive Director $110,000 - $110,000
Gerard McMahon Non-Executive Director $24,887 $2,613 $27,500
Kathy Parsons Non-Executive Director $117,647 $12,353 $130,000
Jim Sloman OAM Non-Executive Director $130,000 - $130,000
Craig van der Laan Non-Executive Director $117,647 $12,353 $130,000

I was thinking, possibly even if I had the perfect prompt I could achieve this. Have been trying for the last day though and haven’t come up with anything so far

This seems like a particular use where I can’t see the investment of time that would get you a payoff by AI automation. PDFs don’t necessarily even have parseable text, they can be scans of documents, even using technology to replace imagery into pieces with layer extractions. Or just have a password on them.

You could start with a preliminary page extraction and ask an AI “does this page discuss executive and employee salaries with money figures?”, until you get a hit on just pages that do.

Many of those problems can be solved by using ghostscript to create a tiff, and then use tesseract to create a hocr file (and from there you can find tables by yourself using spatial comparisons).

hocr2yaml also works great to lower the tokens.

so pdf → gs to tiff → tesseract to hocr → hocr2yaml to yaml → hey bot here is some pdf content [yaml]

pretty sure there are also some hocr to html libs. Using them you might be able to parse the table from HTML.

Thanks for the suggestion, those are all new things to me. Could you help me get started with that approach? Thanks
Also why do you think this approach will be superior?

I think ghostscript is too old to even know about modern security stuff in pdf or just ignores it.

Also hocr keeps spatial informations.

When I used AWS textract I had to combine two different model outputs to get a good result for invoice processing using spatial comparison.

Sometimes a field has one row, sometimes two or more. Using spatial comparison helps alot there.

I assume GPT-4 has some sort of spatial knowledge.

So when there are multiple thematically matching phrases under eachother I guess it understands it quiet well that it could be a table.

1 Like

I am a data scientist at Sensible and we have streamlined extracting tabled from PDFs using LLMs. app.sensible.so

1 Like

Hope you have solved this. My two cents on this is that you can use llm to determine the likely_pages. This way it might be more customized and accurate

Like slice the pdf in single pages and look for tabelaric data… but sometimes a table might go over multiple pages. So something on top of that that might need to agregate them would be good.

I would put all into a database and work with gis functions on the bounding boxes

How has your progress been on this? I’ve found PyMuPDF/Fitz to be very good with tables. Since your tables always have similar names, you could combine Fitz with regex to pinpoint them within your document.