Read into pdf and output table

I am trying to write R code to read in a pdf, and then use chatgpt to make sense of the often messy text and then output it as a table or data fram.e
I know this is possible because if I copy paste the text from a pdf into chatgpt interface and prompt it to “convert to table” it does it perfectly.

This is currently my code:

pdf_text <- pdf_text("1pagepdffile.pdf")

pdf_text <- paste(pdf_text, collapse = " ")  # Collapse multiple pages into a single string


# API call to GPT-4
response <- POST(
  "model url",
  add_headers("Authorization" = "Bearer APIKEY", "Content-Type" = "application/json"),
  body = list(
    prompt = paste("Please format the following data as a table:", pdf_text),
    max_tokens = 500  # You can adjust this based on your needs
  encode = "json"

# Parse the response to get the text output
response_content <- content(response, "parsed")
response_text <- response_content$choices[[1]]$text

# Print the response or write to a file

Any help would be appreciated.

I GPTed that for ya!

When tabularizing data what I do sometimes is say something like, please find all the peoples names, and phone numbers, and addresses, from this text and put them into a tabular format.

I feel better telling it what to find, but based on what you’ve said a “Go Tabularize” might be just as good.

1 Like

Thanks, but I want to utilise GPT to make sense of the messy text I present it to format it into a nice table.

Using tools like Tika it takes about one line of code to extract text from any kind of file format. In case you didn’t know.

However yeah, I bet just for convenience it won’t be long until OpenAI has this kind of support built in, since it’s so easy to implement.

I made a multi-page AI PDF extractor/table maker for you with python. Streaming console output so you can see what it’s doing for half an hour.

The prompt works, but not as commanded (typical dumb 3.5 now). We don’t get body text ignored on table-less pages, and low-quality markdown tables are more likely. So use your prompt techniques.

on top of your Python ~3.8:
pip install openai
pip install PyPDF2

Then the code (extracted and AI processed pages are also dumped for diagnosis, before a final collection of all):

import PyPDF2
import openai
import os

# Replace with your OpenAI API key and model
openai.api_key = "sk-xxx"
my_ai_model = "gpt-3.5-turbo"

pdf_file = "2202.pdf"

def aiprocessor(page_no, text):
    print(f"\n\n..AI processing page {page_no}")
    messages = [
            "role": "system",
            "content": """You are a PDF table extractor, a backend processor.
- User input is messy raw text extracted from a PDF page by PyPDF2.
- Do not output any body text, we are only interested in tables.
- The goal is to identify tabular data, and reproduce it cleanly as comma-separated table.
- Preface each output table with a line giving title and 10 word summary.
- Reproduce each separate table found in page."""
            "role": "user",
            "content": "raw pdf text; extract and format tables:" + text

    api_params = {"model": my_ai_model, "messages": messages, "stream": True}
        api_response = openai.ChatCompletion.create(**api_params)
        reply = ""
        for delta in api_response:
            if not delta['choices'][0]['finish_reason']:
                word = delta['choices'][0]['delta']['content']
                reply += word
                print(word, end ="")       
        return reply
    except Exception as err:
        error_message = f"API Error page {page_no}: {str(err)}"

# Create a list to store AI-processed text
ai_processed_text_list = []

# Open the PDF file in binary mode
with open(pdf_file, 'rb') as pdf_file:
    pdf_reader = PyPDF2.PdfReader(pdf_file)

    # Iterate through each page and extract text
    for page_num in range(len(pdf_reader.pages)):
        page = pdf_reader.pages[page_num]
        page_text = page.extract_text()

        if len(page_text)>20:
            # Dump unprocessed pages if desired
            page_text_file = + "-extractedpage" + str(page_num) + ".txt"
            with open(page_text_file, 'w', encoding='utf-8') as output_file:

            # Process with AI
            ai_processed_text = aiprocessor(page_num, page_text)

            # Dump AI pages if desired
            page_text_file = + "-AIpage" + str(page_num) + ".txt"
            with open(page_text_file, 'w', encoding='utf-8') as output_file:

            # Append the AI-processed text to the list

# Combine all AI-processed text into a single string
combined_text = "\n".join(ai_processed_text_list)

# Define the output text file name (same root name as the PDF)
output_text_file = + "-AI-all.txt"

# Save the combined text into a .txt file
with open(output_text_file, 'w', encoding='utf-8') as output_file:

print(f"AI-processed text saved to {output_text_file}")

Example pages 3 & 4:

Title: Amplifier Installation Instructions

Summary: This section provides instructions for connecting and configuring the amplifier, including connecting the power antenna lead, using RCA jacks, engaging optional functions, and configuring for bridged mono operation or two-ohm capability.

Table 1: Amplifier Connections
| Wire Color       | Connection            |
| Red and White    | Power Antenna Lead    |
| RCA Jacks        | High or Low Level     |
|                  | Signals               |

Table 2: Bridged Mono Configuration
| Switch Position          | Configuration                            |
| Stereo                   | Normal Stereo Amplifier                   |
| Bridged                  | Bridged Mono Amplifier                    |

Table 3: Two-Ohm Capability Configuration
| Jumper Configuration     | Load Capability                          |
| 4-ohm lugs               | 2-ohm Stereo Load or 4-ohm Mono Load      |
| 2-ohm lugs               | 2-ohm Stereo Load or 4-ohm Mono Load      |

Table 4: Speaker Output Connections
| Terminal               | Speaker Connection                         |
| LEFT (-)               | Left Channel Negative                       |
| LEFT (+)               | Left Channel Positive                       |
| RIGHT (+)              | Right Channel Positive                      |
| RIGHT (-)              | Right Channel Negative                      |
Title: Operation/Adjustment and General Troubleshooting of Linear Power Amplifier

Summary: This document provides instructions for adjusting the sensitivity of the amplifier and deck to minimize distortion and achieve maximum power output. It also offers troubleshooting tips for common issues such as no sound, blown fuses, and unexpected shut off.

Table 1: Operation/Adjustment
amp sensitivity | deck output | sensitivity control | amp output
minimum         | cleanest    | minimum             | cleanest
slightly higher | slightly    | slightly higher     | slightly
                | distorted   |                     | distorted
maximum         | usable      | maximum             | maximum
                | output      | usable output       | output

Table 2: General Troubleshooting
Issue        | Checks and Solutions
No sound     | - Check connections
             | - Check main power fuses
             | - Check accessory fuse
             | - Verify presence of +12v at amplifier
             | - Ensure a good ground connection
             | - Check music source for proper operation
Blows fuses  | - Check power and speaker wire connections
             | - Verify polarity of main power wires
             | - Check speaker impedance and power tap settings
Shuts off    | - Check for high ambient temperature or improper speaker impedance
             | - Turn down volume while waiting for the amp to turn back on
             | - Use a fan to cool the amplifier if issue persists
             | - Verify proper speaker loads and connections

very cool and it was one line of code just like it likely will be in his “R” implementation! Noice!

 page_text = page.extract_text()

This is really impressive thank you!
One question how can I then send the table to a df to be exported as a csv.

This is my current output:

..AI processing page 0
Table Title: Non-Executive Director Remuneration
Summary: This table lists non-executive directors' compensation for 2022 and 2023 of Lindsay Australia

Director,Year,Salary and fees ($), Cash Bonus ($), Non-Monetary benefits ($), Long service leave ($), Superannuation ($), Options ($), Total ($), Performance related (%)
I M Williams (Chair),2023,110406,0,0,0,11490,0,121896,NA
I M Williams (Chair),2022,70471,0,0,0,7079,0,77550,NA
R L Green,2023,85317,0,0,0,8881,0,94198,NA
R L Green,2022,63278,0,0,0,6331,0,69609,NA
M R Stubbs,2023,85317,0,0,0,8881,0,94198,NA
M R Stubbs,2022,52853,0,0,0,5285,0,58138,NA
S P Cantwell,2023,87799,0,0,0,6865,0,94664,NA
S P Cantwell,2022,34912,0,0,0,3491,0,38403,NA
A R Kelly (resigned 5 November 2021),2023,0,0,0,0,0,0,0,NA
A R Kelly (resigned 5 November 2021),2022,22305,0,0,0,2233,0,24538,NA
R A Anderson (resigned 31 August 2021),2023,0,0,0,0,0,0,0,NA
R A Anderson (resigned 31 August 2021),2022,14224,0,0,0,1427,0,15651,NA
Sub-Total,2022,258043,0,0,0,25846,0,283889,NAAI-processed text saved to lau.pdf-AI-all.txt

This is the output i want excel/csv/database form:

And this was my original pdf:

I am working with hundreds of pdfs where the format of the table is not the exact same. That is why I can’t easily write code

1 Like

ChatGPT thinks you are talking about a pandas dataframe, and you also can ask it “An AI makes a table in markdown, and there are multiple tables per text file. How can I then send the table to a df to be exported as a csv”

Are you able to help me with this prompt or code? Thanks

It sounds like you are going beyond the forum’s mission of empowering “how do I bamboozle an AI into doing my job” into the territory of “how do I bamboozle humans into doing my job”. I’d try the former approach first.

You just say to GPT: “Extract the information in this text into a CSV file” as a line of your prompt above the content, and it will extract to a CSV if it can. Which part of that are you confused about?