Okay I have made progress, but still looking for help.
This is my most recent code:
import PyPDF2
import re
import openai
import os
# Function to count numbers in a string
def count_numbers(text):
return len(re.findall(r'\b\d+\b', text))
# Find likely pages containing the table
likely_pages = []
with open('SHA_full.pdf', 'rb') as file:
pdf_reader = PyPDF2.PdfReader(file)
num_pages = len(pdf_reader.pages)
for i in range(num_pages):
page = pdf_reader.pages[i]
text = page.extract_text()
if any(keyword.lower() in text.lower() for keyword in ['key management personnel', 'KMP', 'salary']):
num_count = count_numbers(text)
if num_count > 30 and '$' in text:
likely_pages.append((i + 1, num_count))
# Sort likely_pages by the number count and limit to top 5
likely_pages = sorted(likely_pages, key=lambda x: x[1], reverse=True)[:5]
likely_pages = [x[0] for x in likely_pages]
# Initialize OpenAI API
openai.api_key = "sk-XXX" # Replace with your actual OpenAI API key
my_ai_model = "gpt-3.5-turbo"
pdf_file = "SHA_full.pdf" # Replace with your actual PDF file
# Function to process the text through AI
def aiprocessor(page_no, text):
print(f"\n\n..AI processing page {page_no}")
messages = [
{
"role": "system",
"content": """You are a PDF table extractor, a backend processor.
- User input is messy raw text extracted from a PDF page by PyPDF2.
- Do not output any body text, we are only interested in tables.
- The goal is to identify tabular data, and reproduce it cleanly a table.
- Look for the table that shows the remuneration for the executives and the non-executives. Do your best in recognises which table it is!
- Only print 3 tables max. before you print the first table be sure it is the one i want, tehre should be numbers and names and a few columns at least
- finally the table that you did choose, tell me the page number in the pdf of that."""
},
{
"role": "user",
"content": "raw pdf text; extract and format tables:" + text
}
]
try:
api_response = openai.ChatCompletion.create(model=my_ai_model, messages=messages, stream=True)
reply = ""
for delta in api_response:
if not delta['choices'][0]['finish_reason']:
word = delta['choices'][0]['delta']['content']
reply += word
print(word, end="")
return reply
except Exception as err:
error_message = f"API Error page {page_no}: {str(err)}"
print(error_message)
# Create a list to store AI-processed text
ai_processed_text_list = []
# Open the PDF file in binary mode
with open(pdf_file, 'rb') as pdf_file:
pdf_reader = PyPDF2.PdfReader(pdf_file)
# Iterate only through the likely pages and extract text
for page_num in likely_pages:
page = pdf_reader.pages[page_num - 1]
page_text = page.extract_text()
if len(page_text) > 20:
# Process with AI
ai_processed_text = aiprocessor(page_num, page_text)
# Append the AI-processed text to the list
ai_processed_text_list.append(ai_processed_text)
# Combine all AI-processed text into a single string
combined_text = "\n".join(ai_processed_text_list)
# Define the output text file name (same root name as the PDF)
output_text_file = pdf_file.name + "-AI-all.txt"
# Save the combined text into a .txt file
with open(output_text_file, 'w', encoding='utf-8') as output_file:
output_file.write(combined_text)
print(f"AI-processed text saved to {output_text_file}")
Code output
..AI processing page 46
Table 6: Statutory Non-Executive Director Remuneration for FY23
Name | Role | Board and Committee Fees | Superannuation | Total
-------------------|-------------------------------|--------------------------|----------------|--------------
Greg Miles | Chairman, Non-Executive Director | $166,666 | - | $166,666
Phillip Arnall | Chairman, Non-Executive Director | $28,333 | - | $28,333
Michael Barnes | Non-Executive Director | $99,548 | $10,452 | $110,000
Jane Lloyd | Non-Executive Director | $110,000 | - | $110,000
Gerard McMahon | Non-Executive Director | $24,887 | $2,613 | $27,500
Kathy Parsons | Non-Executive Director | $117,647 | $12,353 | $130,000
Jim Sloman OAM | Non-Executive Director | $130,000 | - | $130,000
Craig van der Laan | Non-Executive Director | $117,647 | $12,353 | $130,000
The table is located on page 4 of the PDF.
..AI processing page 43
Table 4: Actual Executive Remuneration
Name | Role | Fixed Super Package | Total STI Awarded Following Completion of the Financial Year | Value of LTI (SELTI) that Vested Following Completion of the Measurement Period/FY21 | Amount % of TRP | Amount* % of TRP | Amount % of TRP | Total Remuneration Package (TRP)
-----------------|--------------------|---------------------|-----------------------------------------------------------|---------------------------------------------------------------------------------------|-----------------|------------------|-----------------|----------------------------
Peter Marix-Evans | CEO & Managing Director | $850,000 | 59% | $595,000 | 41% | $0 | 0% | $1,445,000
Scott Jamieson | CFO & Joint Company Secretary | $509,000 | 63% | $305,400 | 37% | $0 | 0% | $814,400
Page number: 43
..AI processing page 48
Table 1: Executive KMP
| Name | Tranche FY23 LTI Performance Rights | Grant Type | Vesting Conditions | Grant Date | Grant Number | Total Value at Grant | Value Expensed in FY23 | Max Value to be Expensed in Future Years |
|-----------------|------------------------------------|------------|----------------------------------|------------|--------------|----------------------|-----------------------|-----------------------------------------|
| Peter Marix-Evans | 228,925 | LTI | Net Profit Growth | 30/09/2022 | 340,000 | 113,333 | 226,667 | $ 328,467 |
| Scott Jamieson | 102,814 | LTI | Net Profit Growth | 30/09/2022 | 152,700 | 50,900 | 101,800 | $ 164,233 |
| Totals | 331,739 | | | | | 492,700 | 164,233 | $ 492,700 |
Table 2: Service Agreements for Executive KMP
| Name | Position Held at Close of FY23 | Employing Company | Duration of Contract | From Company | From KMP | Termination Payments* |
|---------------------|-------------------------------------|--------------------------------|----------------------|--------------|-----------|----------------------|
| Peter Marix-Evans | CEO & Managing Director | Shape Australia Corporation Limited | Ongoing | 12 months | 12 months | - |
| Scott Jamieson | CFO & Joint Company Secretary | Shape Australia Corporation Limited | Ongoing | 3 months | 3 months | - |
Please note that the first table (Table 1) shows the remuneration for the Executive KMP, and it is located on Page 1 of the PDF.
..AI processing page 44
Table 5: Statutory Executive Remuneration
| | Peter Marix-Evans | Scott Jamieson |
|--------|--------------------|---------------------|
| Role | CEO & Managing Director | CFO & Joint Company Secretary |
| Fixed PaySalary | $825,103 | $486,571 |
| Super | $27,500 | $27,500 |
| Other Benefits | $2,841 | $2,841 |
| Total Fixed Pay | $855,444 | $516,912 |
| Variable RemunerationCash STVR * | $389,015 | $188,457 |
| LTI ** | $61,040 | $27,150 |
| Total Remuneration Package (TRP) | $1,305,499 | $732,519 |
| Variable Remuneration as a % of TRP | 34% | 29% |
| Change in Accrued Leave | $11,006 | $59,741 |
Note:
* The STVR value reported in this table is the STVR that was paid during the reporting period, being the award earned during the previous period. Incentive outcomes for the reporting and previous period are outlined elsewhere in this report.
** The LTVR value reported in this table is the amortised accounting charge of all grants that have not lapsed or vested as at the start of the reporting period.
The table above shows the Statutory Executive Remuneration for Peter Marix-Evans and Scott Jamieson. The page number in the PDF where this table is found is not provided.
..AI processing page 95
Table 1:
+--------------------------------+--------------+--------------+
| Title | 2023 | 2022 |
+--------------------------------+--------------+--------------+
| Short-term employee benefits | 2,712,546 | 3,019,173 |
| Post-employment benefits | 92,771 | 109,343 |
| Long-term benefits | 47,756 | 22,147 |
| Share-based payments | 88,190 | (247,207) |
| Total Compensation | 2,941,263 | 2,903,456 |
+--------------------------------+--------------+--------------+
Table 2:
+------------------+-------------------------------------------------+
| Plan | Description |
+------------------+-------------------------------------------------+
| SELTI | Employee incentive plan for senior management |
| | and other eligible employees |
| ESAP | Employee incentive plan for all Group employees |
+------------------+-------------------------------------------------+
Table 3:
+------------------------------------------------------------------+
| Company Details |
+------------------------------------------------------------------+
| Company Name | SHAPE Australia Corporation Limited |
| ABN | 14 654 729 352 |
| ASX | SHA |
+------------------------------------------------------------------+
The table that shows the compensation for key management personnel is Table 1. The page number of this table is 27.AI-processed text saved to SHA_full.pdf-AI-all.txt
You can see it finds the 2 tables I want (reposted below). I guess now the step is for it to ONLY find these two tables. Any ideas?. Thanks
Table 5: Statutory Executive Remuneration
|
Peter Marix-Evans |
Scott Jamieson |
Role |
CEO & Managing Director |
CFO & Joint Company Secretary |
Fixed PaySalary |
$825,103 |
$486,571 |
Super |
$27,500 |
$27,500 |
Other Benefits |
$2,841 |
$2,841 |
Total Fixed Pay |
$855,444 |
$516,912 |
Variable RemunerationCash STVR * |
$389,015 |
$188,457 |
LTI ** |
$61,040 |
$27,150 |
Total Remuneration Package (TRP) |
$1,305,499 |
$732,519 |
Variable Remuneration as a % of TRP |
34% |
29% |
Change in Accrued Leave |
$11,006 |
$59,741 |
…AI processing page 46
Table 6: Statutory Non-Executive Director Remuneration for FY23
Name |
Role |
Board and Committee Fees |
Superannuation |
Total |
Greg Miles |
Chairman, Non-Executive Director |
$166,666 |
- |
$166,666 |
Phillip Arnall |
Chairman, Non-Executive Director |
$28,333 |
- |
$28,333 |
Michael Barnes |
Non-Executive Director |
$99,548 |
$10,452 |
$110,000 |
Jane Lloyd |
Non-Executive Director |
$110,000 |
- |
$110,000 |
Gerard McMahon |
Non-Executive Director |
$24,887 |
$2,613 |
$27,500 |
Kathy Parsons |
Non-Executive Director |
$117,647 |
$12,353 |
$130,000 |
Jim Sloman OAM |
Non-Executive Director |
$130,000 |
- |
$130,000 |
Craig van der Laan |
Non-Executive Director |
$117,647 |
$12,353 |
$130,000 |