Import differing text files and build a database - Python

I have text files that contain tables that I have extracted from pdfs.
Some text files contain 1 table and some contain 2 or more. They are all in CSV format.

I want to import them into python, save them as a dataframe and build a database. Each textfile is for a different company (fake data).

My End goal is something that looks like this:

Company Executive Year Cash Salary and Fees Cash STI/Bonus Non-monetary Benefits Superannuation Annual and Long Service Leave Deferred STI LTI Awards Total
AA John Doe 2023 XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX
AA John Doe 2022 XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX
AA Greg Doe 2023 XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX
AA Greg Doe 2022 XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX
BB Steve Doe 2023 XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX
BB Steve Doe 2022 XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX
BB Alice Doe 2023 XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX
BB Alice Doe 2022 XXXX XXXX XXXX XXXX XXXX XXXX XXXX XXXX

A problem that I will run into is that the headings for the tables are not always the same. So when building the database if the headings are the same I am obviously happy to merge the data. Otherwise I am happy to create a new column ( I dont want to lose data)

Another thing is you can see below sometimes the text file has the table name and sometimes it doesnt, so importing this int python I have to handle this. Coding this seems tricky, so perhaps utilising the chatgpt 4 engine is the best thing to do.

Any help would be greatly appreciated thanks.

2 examples of my text files are below

File1:

Table 1:
Executive Name, Year, Fees/Salary, Cash Bonus, Non-monetary Benefit, Superannuation, Cash settled share based payments, Equity settled share-based payments, Total, Percentage of remuneration related to performance, Cash Bonus as a percentage of maximum achievable
Timothy Burnett, 2023, 110481, -, -, 11601, -, -, 122082, -, -
Greg Fletcher, 2023, 72118, -, -, 7573, -, -, 79691, -, -
Nick Yates, 2023, 72118, -, -, 7573, -, -, 79691, -, -
Mark Benson, 2023, 548239, 353243, 14427, 27132, -, 107601, 1050642, 43.9%, 82.0%
Brett Gregory, 2023, 184384, 67242, -, 13256, 48516, -, 313398, 36.9%, 82.0%
Rudy Sheriff, 2023, 289245, 80000, 3303, 16862, -, -, 389410, 20.5%, 98.0%

Table 2:
Executive Name, Year, Fees/Salary, Cash Bonus, Non-monetary Benefit, Superannuation, Cash settled share based payments, Equity settled share-based payments, Total, Percentage of remuneration related to performance, Cash Bonus as a percentage of maximum achievable
Timothy Burnett, 2022, 104545, -, -, 10455, -, -, 115000, -, -
Greg Fletcher, 2022, 68182, -, -, 6818, -, -, 75000, -, -
Nick Yates, 2022, 68182, -, -, 6818, -, -, 75000, -, -
Mark Benson, 2022, 516813, 280404, 13185, 23568, -, 111300, 945270, 41.4%, 86.0%
Rudy Sheriff, 2022, 326637, 97141, 6192, 23568, -, 34932, 488470, 27.0%, 90.0%

File2

Name,Year,Salary and Fees,Cash Bonus,Non-monetary Benefits,Long Service Leave,Superannuation,Options,Total,Performance Related
I M Williams (Chair),2023,110406,-,-,-,11490,-,121896,NA
I M Williams (Chair),2022,70471,-,-,-,7079,-,77550,NA
R L Green,2023,85317,-,-,-,8881,-,94198,NA
R L Green,2022,63278,-,-,-,6331,-,69609,NA
M R Stubbs,2023,85317,-,-,-,8881,-,94198,NA
M R Stubbs,2022,52853,-,-,-,5285,-,58138,NA
S P Cantwell,2023,87799,-,-,-,6865,-,94664,NA
S P Cantwell,2022,34912,-,-,-,3491,-,38403,NA
A R Kelly (resigned 5 November 2021),2023,-,-,-,-,-,-,-,NA
A R Kelly (resigned 5 November 2021),2022,22305,-,-,-,2233,-,24538,NA
R A Anderson (resigned 31 August 2021),2023,-,-,-,-,-,-,-,NA
R A Anderson (resigned 31 August 2021),2022,14224,-,-,-,1427,-,15651,NA
Sub-Total,2023,368839,-,-,-,36117,-,404956,NA
Sub-Total,2022,258043,-,-,-,25846,-,283889,NA

Hi,

If you copy/paste that post exactly as is into GPT-4, it will likely give you a great start on your problem, at the moment you don’t seem to have any code at all, so that would be a first start, chat with GPT-4 about what you want and see how that works for you.

Hi, thanks for the suggestion. I have tried this many times and can’t find any good and working solution. This is why I went to the forums

Also chatgpt isn’t trained up to the release of the chatgpt4 api, so the solutions provided are not optimal for implementation of the api

Howdy,

Great question. Let’s break the different tasks into parts, since some seem like better fits for AI than other ones.

1. Determining whether there are multiple tables in a file and where they are

This problem is probably better handled via other methods. For instance, if tables are consistently separated by an empty row, or they have a clear header labeling them, then I would suggest simply writing some sort of loop or other process to detect that flag and split the table at that point. This is probably not a good problem to use AI for because it would involve sending the entire file to the prompt, and that would be very expensive token-wise. Also, performance would be pretty unpredictable in my experience.

Taking your screenshot example something like this would probably work (shamelessly taken from GPT-4):

import pandas as pd

# Reading the entire file
with open("File.txt", "r") as file: # your file path goes here
    all_lines = file.readlines()

# Variables to store the start and end of each table
tables = []
start_line = None

# Iterating through the file lines to locate the tables
for index, line in enumerate(all_lines):
    if "Table " in line:  # A new table starts
        if start_line is not None:  # This means another table ended just before this line
            tables.append((start_line, index-1))
        start_line = index
    elif not line.strip() and start_line is not None:  # An empty line after a table's start has been detected
        tables.append((start_line, index-1))
        start_line = None

# If the file ends without an empty line after the last table
if start_line is not None:
    tables.append((start_line, len(all_lines)-1))

# Extracting tables into dataframes
dataframes = []
for start, end in tables:
    # Getting the table data
    table_data = all_lines[start+2:end+1]
    # Creating a dataframe
    df = pd.DataFrame([line.strip().split(", ") for line in table_data], 
                      columns=all_lines[start+1].strip().split(", "))
    dataframes.append(df)

# Returning the first two dataframes for verification
dataframes[:2]

2. Labeling the Tables
I’m dubious as to the value proposition here, but you could craft an API call which passes the field names from the file to the API and asks for it to return a descriptive name of the data set. Personally, I would prefer some sort of name based on the filename the table is taken from and the table index if there are multiple tables in the file, as that provides a way to track the table back to its source.

3. Resolving Header Differences
This is a great and underexplored use case for the API, IMO. I actually have several programs I use at my office that do something like this. However, in each case, I’m resolving a sort of one-to-one merge, which may or may not be what your use case requires. In any case, basically, you just pass a prompt like the following to the API.

def semantic_match(list_a, list_b):
    """
    This function finds semantic matches for values in list a in list b.
    It uses GPT-4 to create chat prompts for the matching process.

    :param list_a: list of values to be matched
    :param list_b: list to be searched for matches
    :return: a dictionary with matched values

    Raises ValueError: If the provided parameters are not lists.
    """

    # Check if parameters are lists
    if not isinstance(list_a, list) or not isinstance(list_b, list):
        raise ValueError('Both parameters must be of list type.')

    # Copy and sort the input lists
    list_a_copy = list_a.copy()
    list_b_copy = list_b.copy()
    list_a_copy.sort()
    list_b_copy.sort()

    # Define the system message
    system_message = ("INSTRUCTION: Match the values from list_a to the values in list_b. "
                      "Return the matched values as a JSON object.\n"
                      "EXAMPLE:\n"
                      "list_a : ['one', 'three', 'six', 'seven']\n"
                      "list_b : [1, 2, 3, 4, 5, 10]\n"
                      "response : {'one':1, 'three':3}\n")

    # Define the user message
    user_prompt = ("PROMPT:\n"
                   f"list_a : {list_a_copy}\n"
                   f"list_b : {list_b_copy}\n"
                   "response : ")

    # Create the chat completion
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": system_message},
            {"role": "user", "content": user_prompt},
        ]
    )

    model_message = response['choices'][0]['message']['content'].replace("'","\"")

    # Parse and return the response as a dictionary
    try:
        matched_dict = json.loads(model_message)
    except json.JSONDecodeError:
        raise ValueError(f'The model did not return a valid JSON response.\nModel Message : {model_message}')

    return matched_dict

This gives you a dict of renames to apply to your list. You will want to print out the results and really audit it, but I’ve found it works very well for my use cases. You also might want to consider passing it a None option as well.

This general concept is called Semantic Matching, meaning we are matching two objects based on them sharing the same meaning, rather than form. It is a task that constantly recurs in workspaces around the world, and I think AI is really promising for solving it.

Thanks very much I will look into this

Let it create a json - give the expected json as an example.
Tell it to only give you json and nothing else is allowed no explanation or anything else.

Or ask it to give you a table structure and a set of inserts for that table.