Import differing text files and build a database - Python

rl16 · September 11, 2023, 6:24am

I have text files that contain tables that I have extracted from pdfs.
Some text files contain 1 table and some contain 2 or more. They are all in CSV format.

I want to import them into python, save them as a dataframe and build a database. Each textfile is for a different company (fake data).

My End goal is something that looks like this:

Company	Executive	Year	Cash Salary and Fees	Cash STI/Bonus	Non-monetary Benefits	Superannuation	Annual and Long Service Leave	Deferred STI	LTI Awards	Total
AA	John Doe	2023	XXXX	XXXX	XXXX	XXXX	XXXX	XXXX	XXXX	XXXX
AA	John Doe	2022	XXXX	XXXX	XXXX	XXXX	XXXX	XXXX	XXXX	XXXX
AA	Greg Doe	2023	XXXX	XXXX	XXXX	XXXX	XXXX	XXXX	XXXX	XXXX
AA	Greg Doe	2022	XXXX	XXXX	XXXX	XXXX	XXXX	XXXX	XXXX	XXXX
BB	Steve Doe	2023	XXXX	XXXX	XXXX	XXXX	XXXX	XXXX	XXXX	XXXX
BB	Steve Doe	2022	XXXX	XXXX	XXXX	XXXX	XXXX	XXXX	XXXX	XXXX
BB	Alice Doe	2023	XXXX	XXXX	XXXX	XXXX	XXXX	XXXX	XXXX	XXXX
BB	Alice Doe	2022	XXXX	XXXX	XXXX	XXXX	XXXX	XXXX	XXXX	XXXX

A problem that I will run into is that the headings for the tables are not always the same. So when building the database if the headings are the same I am obviously happy to merge the data. Otherwise I am happy to create a new column ( I dont want to lose data)

Another thing is you can see below sometimes the text file has the table name and sometimes it doesnt, so importing this int python I have to handle this. Coding this seems tricky, so perhaps utilising the chatgpt 4 engine is the best thing to do.

Any help would be greatly appreciated thanks.

2 examples of my text files are below

File1:

Table 1:
Executive Name, Year, Fees/Salary, Cash Bonus, Non-monetary Benefit, Superannuation, Cash settled share based payments, Equity settled share-based payments, Total, Percentage of remuneration related to performance, Cash Bonus as a percentage of maximum achievable
Timothy Burnett, 2023, 110481, -, -, 11601, -, -, 122082, -, -
Greg Fletcher, 2023, 72118, -, -, 7573, -, -, 79691, -, -
Nick Yates, 2023, 72118, -, -, 7573, -, -, 79691, -, -
Mark Benson, 2023, 548239, 353243, 14427, 27132, -, 107601, 1050642, 43.9%, 82.0%
Brett Gregory, 2023, 184384, 67242, -, 13256, 48516, -, 313398, 36.9%, 82.0%
Rudy Sheriff, 2023, 289245, 80000, 3303, 16862, -, -, 389410, 20.5%, 98.0%

Table 2:
Executive Name, Year, Fees/Salary, Cash Bonus, Non-monetary Benefit, Superannuation, Cash settled share based payments, Equity settled share-based payments, Total, Percentage of remuneration related to performance, Cash Bonus as a percentage of maximum achievable
Timothy Burnett, 2022, 104545, -, -, 10455, -, -, 115000, -, -
Greg Fletcher, 2022, 68182, -, -, 6818, -, -, 75000, -, -
Nick Yates, 2022, 68182, -, -, 6818, -, -, 75000, -, -
Mark Benson, 2022, 516813, 280404, 13185, 23568, -, 111300, 945270, 41.4%, 86.0%
Rudy Sheriff, 2022, 326637, 97141, 6192, 23568, -, 34932, 488470, 27.0%, 90.0%

File2

Name,Year,Salary and Fees,Cash Bonus,Non-monetary Benefits,Long Service Leave,Superannuation,Options,Total,Performance Related
I M Williams (Chair),2023,110406,-,-,-,11490,-,121896,NA
I M Williams (Chair),2022,70471,-,-,-,7079,-,77550,NA
R L Green,2023,85317,-,-,-,8881,-,94198,NA
R L Green,2022,63278,-,-,-,6331,-,69609,NA
M R Stubbs,2023,85317,-,-,-,8881,-,94198,NA
M R Stubbs,2022,52853,-,-,-,5285,-,58138,NA
S P Cantwell,2023,87799,-,-,-,6865,-,94664,NA
S P Cantwell,2022,34912,-,-,-,3491,-,38403,NA
A R Kelly (resigned 5 November 2021),2023,-,-,-,-,-,-,-,NA
A R Kelly (resigned 5 November 2021),2022,22305,-,-,-,2233,-,24538,NA
R A Anderson (resigned 31 August 2021),2023,-,-,-,-,-,-,-,NA
R A Anderson (resigned 31 August 2021),2022,14224,-,-,-,1427,-,15651,NA
Sub-Total,2023,368839,-,-,-,36117,-,404956,NA
Sub-Total,2022,258043,-,-,-,25846,-,283889,NA

Foxalabs · September 11, 2023, 8:40am

Hi,

If you copy/paste that post exactly as is into GPT-4, it will likely give you a great start on your problem, at the moment you don’t seem to have any code at all, so that would be a first start, chat with GPT-4 about what you want and see how that works for you.

rl16 · September 11, 2023, 9:17am

Hi, thanks for the suggestion. I have tried this many times and can’t find any good and working solution. This is why I went to the forums

Also chatgpt isn’t trained up to the release of the chatgpt4 api, so the solutions provided are not optimal for implementation of the api

Craigolodon · September 11, 2023, 6:08pm

Howdy,

Great question. Let’s break the different tasks into parts, since some seem like better fits for AI than other ones.

1. Determining whether there are multiple tables in a file and where they are

This problem is probably better handled via other methods. For instance, if tables are consistently separated by an empty row, or they have a clear header labeling them, then I would suggest simply writing some sort of loop or other process to detect that flag and split the table at that point. This is probably not a good problem to use AI for because it would involve sending the entire file to the prompt, and that would be very expensive token-wise. Also, performance would be pretty unpredictable in my experience.

Taking your screenshot example something like this would probably work (shamelessly taken from GPT-4):

import pandas as pd

# Reading the entire file
with open("File.txt", "r") as file: # your file path goes here
    all_lines = file.readlines()

# Variables to store the start and end of each table
tables = []
start_line = None

# Iterating through the file lines to locate the tables
for index, line in enumerate(all_lines):
    if "Table " in line:  # A new table starts
        if start_line is not None:  # This means another table ended just before this line
            tables.append((start_line, index-1))
        start_line = index
    elif not line.strip() and start_line is not None:  # An empty line after a table's start has been detected
        tables.append((start_line, index-1))
        start_line = None

# If the file ends without an empty line after the last table
if start_line is not None:
    tables.append((start_line, len(all_lines)-1))

# Extracting tables into dataframes
dataframes = []
for start, end in tables:
    # Getting the table data
    table_data = all_lines[start+2:end+1]
    # Creating a dataframe
    df = pd.DataFrame([line.strip().split(", ") for line in table_data], 
                      columns=all_lines[start+1].strip().split(", "))
    dataframes.append(df)

# Returning the first two dataframes for verification
dataframes[:2]

2. Labeling the Tables
I’m dubious as to the value proposition here, but you could craft an API call which passes the field names from the file to the API and asks for it to return a descriptive name of the data set. Personally, I would prefer some sort of name based on the filename the table is taken from and the table index if there are multiple tables in the file, as that provides a way to track the table back to its source.

3. Resolving Header Differences
This is a great and underexplored use case for the API, IMO. I actually have several programs I use at my office that do something like this. However, in each case, I’m resolving a sort of one-to-one merge, which may or may not be what your use case requires. In any case, basically, you just pass a prompt like the following to the API.

def semantic_match(list_a, list_b):
    """
    This function finds semantic matches for values in list a in list b.
    It uses GPT-4 to create chat prompts for the matching process.

    :param list_a: list of values to be matched
    :param list_b: list to be searched for matches
    :return: a dictionary with matched values

    Raises ValueError: If the provided parameters are not lists.
    """

    # Check if parameters are lists
    if not isinstance(list_a, list) or not isinstance(list_b, list):
        raise ValueError('Both parameters must be of list type.')

    # Copy and sort the input lists
    list_a_copy = list_a.copy()
    list_b_copy = list_b.copy()
    list_a_copy.sort()
    list_b_copy.sort()

    # Define the system message
    system_message = ("INSTRUCTION: Match the values from list_a to the values in list_b. "
                      "Return the matched values as a JSON object.\n"
                      "EXAMPLE:\n"
                      "list_a : ['one', 'three', 'six', 'seven']\n"
                      "list_b : [1, 2, 3, 4, 5, 10]\n"
                      "response : {'one':1, 'three':3}\n")

    # Define the user message
    user_prompt = ("PROMPT:\n"
                   f"list_a : {list_a_copy}\n"
                   f"list_b : {list_b_copy}\n"
                   "response : ")

    # Create the chat completion
    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
            {"role": "system", "content": system_message},
            {"role": "user", "content": user_prompt},
        ]
    )

    model_message = response['choices'][0]['message']['content'].replace("'","\"")

    # Parse and return the response as a dictionary
    try:
        matched_dict = json.loads(model_message)
    except json.JSONDecodeError:
        raise ValueError(f'The model did not return a valid JSON response.\nModel Message : {model_message}')

    return matched_dict

This gives you a dict of renames to apply to your list. You will want to print out the results and really audit it, but I’ve found it works very well for my use cases. You also might want to consider passing it a None option as well.

This general concept is called Semantic Matching, meaning we are matching two objects based on them sharing the same meaning, rather than form. It is a task that constantly recurs in workspaces around the world, and I think AI is really promising for solving it.

rl16 · September 12, 2023, 5:46am

Thanks very much I will look into this

jochenschultz · September 14, 2023, 12:41pm

Let it create a json - give the expected json as an example.
Tell it to only give you json and nothing else is allowed no explanation or anything else.

Or ask it to give you a table structure and a set of inserts for that table.

Topic		Replies	Views
Read into pdf and output table API gpt-4 , chatgpt , api	10	9126	September 9, 2023
How to feed Table data into OpenAPI GPT model for creating my own customised chatGPT for the workplace? API	18	11557	June 3, 2024
Search long pdf for specific table - possibly need fine tuning model API gpt-4 , fine-tuning , api	10	2700	March 29, 2024
Using large PDFs to make a ChatBot API chatgpt , api	21	5911	December 15, 2023
Factual Search on CSV, Excel and Output as Table API api	4	1144	December 19, 2023

Import differing text files and build a database - Python

Related topics