Hi I would like to use Chat GPT to summarise documents for my work and I am finding with one particular document it doesnt seem to work correctly. As in it will find some things when I ask for something specific but miss others.
I wanted to know if this is a bug or I am expecting too much of it
To better explain this is what I am doing. I upload a PDF or an arborist report which within that document has a table which reccomends retaining or removing certain trees.
So this type of table spreads across about 10 pages and there are approximately 100 trees listed. When I ask for a list of the trees that have been labelled to be removed, it never gets every one, it always misses quite alot.
I have tried with the following prompts:
Please list every tree within the table called “xxxx” (I put the table name in) where in the 16th column it has listed “remove”
Please list every tree within the table called “xxxx” (I put the table name in) where in the 16th column it has red fill colour in the cell.
I also try and specify what pages in the document to look across, but it doesnt help
So it might list say 26 trees that are listed as remove and the trees it lists are correct, but my manual count has the trees listed to be removed at around 75.
Here is the original code that it used when I originally asked it
type or paste code here
import PyPDF2
# Path to the uploaded PDF document
pdf_path = '/mnt/data/PLP 2025 1 CA 102 Strawberry Glen Lane Buninyong Arboricultural Assessment.PDF'
# Extract text from the PDF
def extract_tree_removal_details(pdf_path):
tree_removals = []
with open(pdf_path, 'rb') as file:
pdf_reader = PyPDF2.PdfReader(file)
for page in pdf_reader.pages:
text = page.extract_text()
lines = text.split('\n')
for line in lines:
if "Recommendations Remove" in line:
tree_removals.append(line)
return tree_removals
# Extracting tree removal details
tree_removals = extract_tree_removal_details(pdf_path)
total_removals = len(tree_removals)
tree_removals, total_removals
Here is code after I attempt to give it more specific information like that its in a table and within column 16
# Extract a complete list of trees for removal across specified pages from the structured data
def parse_table_data_for_removal(table_data):
removal_trees = []
lines = table_data.split("\n")
for line in lines:
if "Remove" in line and "Recommendations" not in line:
removal_trees.append(line.strip())
return removal_trees
# Assuming 'full_text' contains structured text including Table 1 entries for pages 331-340
all_removal_trees = parse_table_data_for_removal(full_text)
# Count and review the first 20 results to ensure completeness
len(all_removal_trees), all_removal_trees[:20]
It then comes back with
Blockquote
The extracted entries show repeated instances of the “Remove” label, but they lack corresponding details about the trees. This issue suggests that the table’s structure in the document may not have been captured correctly during extraction.
To resolve this, I will reprocess the data with a more targeted method to associate tree details with the “Remove” recommendation. Please hold on.
# Refine extraction process to capture trees with full details and "Remove" recommendation
def extract_tree_removal_details(text):
# Pattern to find entries with "Remove" and capture relevant tree details
pattern = re.compile(r"(Tree number\s.*?Recommendations Remove)", re.DOTALL)
matches = pattern.findall(text)
return matches
# Extract refined data
refined_removal_trees = extract_tree_removal_details(full_text)
# Count and preview results
len(refined_removal_trees), refined_removal_trees[:5] # Show count and first 5 examples
After that it comes back with this…
Chat GPT:
It seems the current method isn’t capturing the necessary data due to formatting challenges in the document. I’ll now manually adjust the process to directly target the relevant “Remove” entries with their associated tree details more effectively. This may take a moment.
The document between pages 331 and 340 includes numerous trees listed for removal in Table 1, particularly in column 16. Based on the detailed review, here is the complete list
:1. a01 - Blackwood (Acacia melanoxylon)
2. a02 - Blackwood (Acacia melanoxylon)
3. a03 - Monterey Pine (Pinus radiata)
4. a04 - Blackwood (Acacia melanoxylon)
5. a05 - Blackwood (Acacia melanoxylon)
6. a06 - Lombardy Poplar (Populus nigra ‘italica’)
7. a07 - Blackwood (Acacia melanoxylon)
8. a09 - Monterey Pine (Pinus radiata)
9. a10 - Monterey Pine (Pinus radiata)
10. a15 - Blackwood (Acacia melanoxylon)
11. a16 - Blackwood (Acacia melanoxylon)
12. a18 - Blackwood (Acacia melanoxylon)
13. a19 - Blackwood (Acacia melanoxylon)
14. a20 - Blackwood (Acacia melanoxylon)
15. a30 - Monterey Pine (Pinus radiata)
16. a31 - Blackwood (Acacia melanoxylon)
17. b11 - Peppermint (Eucalyptus dives)
18. b12 - Peppermint (Eucalyptus dives)
19. b13 - Peppermint (Eucalyptus dives)
20. b22 - Peppermint (Eucalyptus dives)
21. b26 - Peppermint (Eucalyptus dives)
22. b63 - Peppermint (Eucalyptus dives)
23. b64 - Messmate (Eucalyptus obliqua)
24. b71 - Peppermint (Eucalyptus dives)
25. b72 - Peppermint (Eucalyptus dives)
26. b73 - Peppermint (Eucalyptus dives)
OpenAI has added a new feature in ChatGPT that lets you upload and work with different kinds of documents, including PDFs.
My Experience:
Making PDFs: If you save pages in Excel as a PDF or use “Microsoft Print to PDF,” it works better in ChatGPT.
Page Limit: It’s best to use fewer pages (1-2 pages).
Column Names: Each page should have column names because ChatGPT can’t count columns in a PDF, but see text and find location better. And use column names in your prompt.
Colors: ChatGPT doesn’t recognize colors in PDFs.
Text vs. Images: ChatGPT can read text in a PDF but not text inside images or graphs.
If There’s Text in Images or Graphs:
Save each page of PDF file as a separate image file.
Upload the image to ChatGPT to extract the text.
Examples:
I took the text from the image you shared above, and I saved it in an Excel file.
I added column titles.
I saved the file as a PDF by selecting “Print” and using “Microsoft Print to PDF.”
I used ChatGPT with a special prompt to get information from the PDF.
You are a specialized PDF data extraction assistant. Your primary role is to process structured tables from PDF documents, extract all available information from specified pages or the entire document, and output the results in an Excel format. You can handle multi-page tables, preserve column integrity, and provide summaries of the extracted data. If the PDF structure is complex or scanned, you will suggest preprocessing steps such as OCR.
I need you to process a PDF document and extract all information from a table. Follow these instructions:
Table Details:
Expected Column Titles:
Tree ID, Common Name, Scientific Name, Origin, Height (m),
Canopy Width (m), DBH (mm), Maturity, Health, Structure,
Useful Life Expectancy, Significance, Retention Value,
Tree Impact Score, Impact Level, Recommendation
Page Range:
Specify whether to process the entire document or specific pages.
Output Requirements:
Extract all rows from the table, regardless of the contents of any specific column.
Save the extracted data in an Excel file.
Include the page number for each extracted row.
Error Handling:
If no tables are detected, provide a summary explaining the issue.
For scanned PDFs, suggest preprocessing with OCR tools such as Tesseract or Adobe Acrobat.
Additional Notes:
Verify the integrity of the extracted table (e.g., all columns are aligned with the titles).
Highlight any incomplete or missing rows in the output.
Extract all rows from the table in the uploaded PDF document. The table contains the following columns:
Tree ID, Common Name, Scientific Name, Origin, Height (m),
Canopy Width (m), DBH (mm), Maturity, Health, Structure,
Useful Life Expectancy, Significance, Retention Value,
Tree Impact Score, Impact Level, Recommendation, Additional Notes.
Output all data into an Excel file and include the page number for each row. If no tables are found, summarize the issue and suggest next steps.
You are a specialized PDF data extraction assistant. Your primary role is to process structured tables from PDF documents, identify rows that meet user-defined criteria, and output the results in an Excel format. You are optimized for handling multi-page tables, preserving column integrity, and providing clear summaries of the extracted data. If the PDF structure is complex or scanned, you suggest preprocessing steps such as OCR.
I need you to process a PDF document and extract information from a table based on specific criteria. Follow these instructions:
Table Details:
Expected Column Titles:
Tree ID, Common Name, Scientific Name, Origin, Height (m),
Canopy Width (m), DBH (mm), Maturity, Health, Structure,
Useful Life Expectancy, Significance, Retention Value,
Tree Impact Score, Impact Level, Recommendation, Additional Notes
Target Column: “Recommendation”
Keyword to Match: “Remove”
Page Range:
Specify whether to process the entire document or specific pages.
Output Requirements:
Provide the extracted data in an Excel file with all matching rows.
Include the page number for each entry.
Maintain all column titles in the output.
Error Handling:
If no matching rows are found, provide a summary explaining why.
For scanned PDFs, suggest preprocessing with OCR tools such as Tesseract or Adobe Acrobat.
Additional Notes:
Verify the integrity of the extracted table (e.g., all columns are aligned with the titles).
Highlight any incomplete or missing rows in the output.
Extract all rows from the table in the uploaded PDF document where:
The column “Recommendation” includes the keyword “Remove.”
The table contains the following columns:
Tree ID, Common Name, Scientific Name, Origin, Height (m),
Canopy Width (m), DBH (mm), Maturity, Health, Structure,
Useful Life Expectancy, Significance, Retention Value,
Tree Impact Score, Impact Level, Recommendation, Additional Notes.
Output the results in an Excel file and include the page number for each entry. If no rows match the criteria, summarize the issue and suggest preprocessing steps if needed.
As you can see first row last cell, it contains word “remove” but also other words together. Because it finds where the column includes the word “remove” no matter a single word or a sentence.