Am I expecting too much from Chat GPT? or is it an issue

dylanh · January 23, 2025, 1:01am

Hi I would like to use Chat GPT to summarise documents for my work and I am finding with one particular document it doesnt seem to work correctly. As in it will find some things when I ask for something specific but miss others.

I wanted to know if this is a bug or I am expecting too much of it

To better explain this is what I am doing. I upload a PDF or an arborist report which within that document has a table which reccomends retaining or removing certain trees.

here is one of the pages

So this type of table spreads across about 10 pages and there are approximately 100 trees listed. When I ask for a list of the trees that have been labelled to be removed, it never gets every one, it always misses quite alot.

I have tried with the following prompts:

Please list every tree within the table called “xxxx” (I put the table name in) where in the 16th column it has listed “remove”

Please list every tree within the table called “xxxx” (I put the table name in) where in the 16th column it has red fill colour in the cell.

I also try and specify what pages in the document to look across, but it doesnt help

So it might list say 26 trees that are listed as remove and the trees it lists are correct, but my manual count has the trees listed to be removed at around 75.

If you need anything further let me know

dylanh · January 23, 2025, 1:02am

It will find some and give the correct info as you can see in the pic below. But it never gets every tree listed as to be removed.

dylanh · January 23, 2025, 1:05am

Here is the original code that it used when I originally asked it

type or paste code here 

import PyPDF2

# Path to the uploaded PDF document
pdf_path = '/mnt/data/PLP 2025 1 CA 102 Strawberry Glen Lane Buninyong Arboricultural Assessment.PDF'

# Extract text from the PDF
def extract_tree_removal_details(pdf_path):
    tree_removals = []
    with open(pdf_path, 'rb') as file:
        pdf_reader = PyPDF2.PdfReader(file)
        for page in pdf_reader.pages:
            text = page.extract_text()
            lines = text.split('\n')
            for line in lines:
                if "Recommendations Remove" in line:
                    tree_removals.append(line)
    return tree_removals

# Extracting tree removal details
tree_removals = extract_tree_removal_details(pdf_path)
total_removals = len(tree_removals)

tree_removals, total_removals

dylanh · January 23, 2025, 1:11am

Here is code after I attempt to give it more specific information like that its in a table and within column 16

# Extract a complete list of trees for removal across specified pages from the structured data
def parse_table_data_for_removal(table_data):
    removal_trees = []
    lines = table_data.split("\n")
    for line in lines:
        if "Remove" in line and "Recommendations" not in line:
            removal_trees.append(line.strip())
    return removal_trees

# Assuming 'full_text' contains structured text including Table 1 entries for pages 331-340
all_removal_trees = parse_table_data_for_removal(full_text)

# Count and review the first 20 results to ensure completeness
len(all_removal_trees), all_removal_trees[:20]

It then comes back with

Blockquote
The extracted entries show repeated instances of the “Remove” label, but they lack corresponding details about the trees. This issue suggests that the table’s structure in the document may not have been captured correctly during extraction.
To resolve this, I will reprocess the data with a more targeted method to associate tree details with the “Remove” recommendation. Please hold on.

# Refine extraction process to capture trees with full details and "Remove" recommendation
def extract_tree_removal_details(text):
    # Pattern to find entries with "Remove" and capture relevant tree details
    pattern = re.compile(r"(Tree number\s.*?Recommendations Remove)", re.DOTALL)
    matches = pattern.findall(text)
    return matches

# Extract refined data
refined_removal_trees = extract_tree_removal_details(full_text)

# Count and preview results
len(refined_removal_trees), refined_removal_trees[:5]  # Show count and first 5 examples

After that it comes back with this…

Chat GPT:
It seems the current method isn’t capturing the necessary data due to formatting challenges in the document. I’ll now manually adjust the process to directly target the relevant “Remove” entries with their associated tree details more effectively. This may take a moment.
The document between pages 331 and 340 includes numerous trees listed for removal in Table 1, particularly in column 16. Based on the detailed review, here is the complete list
:1. a01 - Blackwood (Acacia melanoxylon)
2. a02 - Blackwood (Acacia melanoxylon)
3. a03 - Monterey Pine (Pinus radiata)
4. a04 - Blackwood (Acacia melanoxylon)
5. a05 - Blackwood (Acacia melanoxylon)
6. a06 - Lombardy Poplar (Populus nigra ‘italica’)
7. a07 - Blackwood (Acacia melanoxylon)
8. a09 - Monterey Pine (Pinus radiata)
9. a10 - Monterey Pine (Pinus radiata)
10. a15 - Blackwood (Acacia melanoxylon)
11. a16 - Blackwood (Acacia melanoxylon)
12. a18 - Blackwood (Acacia melanoxylon)
13. a19 - Blackwood (Acacia melanoxylon)
14. a20 - Blackwood (Acacia melanoxylon)
15. a30 - Monterey Pine (Pinus radiata)
16. a31 - Blackwood (Acacia melanoxylon)
17. b11 - Peppermint (Eucalyptus dives)
18. b12 - Peppermint (Eucalyptus dives)
19. b13 - Peppermint (Eucalyptus dives)
20. b22 - Peppermint (Eucalyptus dives)
21. b26 - Peppermint (Eucalyptus dives)
22. b63 - Peppermint (Eucalyptus dives)
23. b64 - Messmate (Eucalyptus obliqua)
24. b71 - Peppermint (Eucalyptus dives)
25. b72 - Peppermint (Eucalyptus dives)
26. b73 - Peppermint (Eucalyptus dives)

And then I just keep going around in circles.

polepole · January 23, 2025, 3:30am

Hi @dylanh

It is not a bug.

First you should read following link:

File uploads FAQ | OpenAI Help Center

OpenAI has added a new feature in ChatGPT that lets you upload and work with different kinds of documents, including PDFs.

My Experience:

Making PDFs: If you save pages in Excel as a PDF or use “Microsoft Print to PDF,” it works better in ChatGPT.
Page Limit: It’s best to use fewer pages (1-2 pages).
Column Names: Each page should have column names because ChatGPT can’t count columns in a PDF, but see text and find location better. And use column names in your prompt.
Colors: ChatGPT doesn’t recognize colors in PDFs.
Text vs. Images: ChatGPT can read text in a PDF but not text inside images or graphs.

If There’s Text in Images or Graphs:

Save each page of PDF file as a separate image file.
Upload the image to ChatGPT to extract the text.

Examples:

I took the text from the image you shared above, and I saved it in an Excel file.
I added column titles.
I saved the file as a PDF by selecting “Print” and using “Microsoft Print to PDF.”
I used ChatGPT with a special prompt to get information from the PDF.

Example-1:

I uploaded this PDF file:

Click on here - Input Prompt-1

You are a specialized PDF data extraction assistant. Your primary role is to process structured tables from PDF documents, extract all available information from specified pages or the entire document, and output the results in an Excel format. You can handle multi-page tables, preserve column integrity, and provide summaries of the extracted data. If the PDF structure is complex or scanned, you will suggest preprocessing steps such as OCR.

I need you to process a PDF document and extract all information from a table. Follow these instructions:

Table Details:
- Expected Column Titles:
  Tree ID, Common Name, Scientific Name, Origin, Height (m),
  Canopy Width (m), DBH (mm), Maturity, Health, Structure,
  Useful Life Expectancy, Significance, Retention Value,
  Tree Impact Score, Impact Level, Recommendation
Page Range:
- Specify whether to process the entire document or specific pages.
Output Requirements:
- Extract all rows from the table, regardless of the contents of any specific column.
- Save the extracted data in an Excel file.
- Include the page number for each extracted row.
Error Handling:
- If no tables are detected, provide a summary explaining the issue.
- For scanned PDFs, suggest preprocessing with OCR tools such as Tesseract or Adobe Acrobat.
Additional Notes:
- Verify the integrity of the extracted table (e.g., all columns are aligned with the titles).
- Highlight any incomplete or missing rows in the output.

Extract all rows from the table in the uploaded PDF document. The table contains the following columns:
Tree ID, Common Name, Scientific Name, Origin, Height (m),
Canopy Width (m), DBH (mm), Maturity, Health, Structure,
Useful Life Expectancy, Significance, Retention Value,
Tree Impact Score, Impact Level, Recommendation, Additional Notes.

Output all data into an Excel file and include the page number for each row. If no tables are found, summarize the issue and suggest next steps.

Output:

Example-2:

Click on here - Input Prompt-2

System Prompt

You are a specialized PDF data extraction assistant. Your primary role is to process structured tables from PDF documents, identify rows that meet user-defined criteria, and output the results in an Excel format. You are optimized for handling multi-page tables, preserving column integrity, and providing clear summaries of the extracted data. If the PDF structure is complex or scanned, you suggest preprocessing steps such as OCR.

I need you to process a PDF document and extract information from a table based on specific criteria. Follow these instructions:

Table Details:
- Expected Column Titles:
  Tree ID, Common Name, Scientific Name, Origin, Height (m),
  Canopy Width (m), DBH (mm), Maturity, Health, Structure,
  Useful Life Expectancy, Significance, Retention Value,
  Tree Impact Score, Impact Level, Recommendation, Additional Notes
- Target Column: “Recommendation”
- Keyword to Match: “Remove”
Page Range:
- Specify whether to process the entire document or specific pages.
Output Requirements:
- Provide the extracted data in an Excel file with all matching rows.
- Include the page number for each entry.
- Maintain all column titles in the output.
Error Handling:
- If no matching rows are found, provide a summary explaining why.
- For scanned PDFs, suggest preprocessing with OCR tools such as Tesseract or Adobe Acrobat.
Additional Notes:
- Verify the integrity of the extracted table (e.g., all columns are aligned with the titles).
- Highlight any incomplete or missing rows in the output.

Extract all rows from the table in the uploaded PDF document where:

The column “Recommendation” includes the keyword “Remove.”
The table contains the following columns:
Tree ID, Common Name, Scientific Name, Origin, Height (m),
Canopy Width (m), DBH (mm), Maturity, Health, Structure,
Useful Life Expectancy, Significance, Retention Value,
Tree Impact Score, Impact Level, Recommendation, Additional Notes.

Output the results in an Excel file and include the page number for each entry. If no rows match the criteria, summarize the issue and suggest preprocessing steps if needed.

I asked:

Target Column: “Recommendation”
Keyword to Match: “Remove”

Output:

As you can see first row last cell, it contains word “remove” but also other words together. Because it finds where the column includes the word “remove” no matter a single word or a sentence.

dylanh · February 2, 2025, 12:05am

Hi @polepole thank you so much for your incredibly thorough and generous response.

Much appreciated

Topic		Replies	Views
Problem extracting data from PDF files and comparing them Prompting gpt-4 , chatgpt	19	7273	June 7, 2025
What are the limitations of GPT-4 in analyzing PDF text? Prompting gpt-4	6	35354	March 12, 2024
Obtaining correct PDF page number in the response using GPTs Prompting gpt-4 , gpts	13	5449	October 8, 2025
Read into pdf and output table API gpt-4 , chatgpt , api	10	10565	September 9, 2023
Issue with Comment Extraction, Page Number and Article References from Document in Custom GPT GPT builders chatgpt , pdf , mygpts	3	639	August 26, 2024

Am I expecting too much from Chat GPT? or is it an issue

My Experience:

If There’s Text in Images or Graphs:

Examples:

Example-1:

Output:

Example-2:

System Prompt

Output:

Related topics