I have text files that contain tables that I have extracted from pdfs.
Some text files contain 1 table and some contain 2 or more. They are all in CSV format.
I want to import them into python, save them as a dataframe and build a database. Each textfile is for a different company (fake data).
My End goal is something that looks like this:
Company | Executive | Year | Cash Salary and Fees | Cash STI/Bonus | Non-monetary Benefits | Superannuation | Annual and Long Service Leave | Deferred STI | LTI Awards | Total |
---|---|---|---|---|---|---|---|---|---|---|
AA | John Doe | 2023 | XXXX | XXXX | XXXX | XXXX | XXXX | XXXX | XXXX | XXXX |
AA | John Doe | 2022 | XXXX | XXXX | XXXX | XXXX | XXXX | XXXX | XXXX | XXXX |
AA | Greg Doe | 2023 | XXXX | XXXX | XXXX | XXXX | XXXX | XXXX | XXXX | XXXX |
AA | Greg Doe | 2022 | XXXX | XXXX | XXXX | XXXX | XXXX | XXXX | XXXX | XXXX |
BB | Steve Doe | 2023 | XXXX | XXXX | XXXX | XXXX | XXXX | XXXX | XXXX | XXXX |
BB | Steve Doe | 2022 | XXXX | XXXX | XXXX | XXXX | XXXX | XXXX | XXXX | XXXX |
BB | Alice Doe | 2023 | XXXX | XXXX | XXXX | XXXX | XXXX | XXXX | XXXX | XXXX |
BB | Alice Doe | 2022 | XXXX | XXXX | XXXX | XXXX | XXXX | XXXX | XXXX | XXXX |
A problem that I will run into is that the headings for the tables are not always the same. So when building the database if the headings are the same I am obviously happy to merge the data. Otherwise I am happy to create a new column ( I dont want to lose data)
Another thing is you can see below sometimes the text file has the table name and sometimes it doesnt, so importing this int python I have to handle this. Coding this seems tricky, so perhaps utilising the chatgpt 4 engine is the best thing to do.
Any help would be greatly appreciated thanks.
2 examples of my text files are below
File1:
Table 1:
Executive Name, Year, Fees/Salary, Cash Bonus, Non-monetary Benefit, Superannuation, Cash settled share based payments, Equity settled share-based payments, Total, Percentage of remuneration related to performance, Cash Bonus as a percentage of maximum achievable
Timothy Burnett, 2023, 110481, -, -, 11601, -, -, 122082, -, -
Greg Fletcher, 2023, 72118, -, -, 7573, -, -, 79691, -, -
Nick Yates, 2023, 72118, -, -, 7573, -, -, 79691, -, -
Mark Benson, 2023, 548239, 353243, 14427, 27132, -, 107601, 1050642, 43.9%, 82.0%
Brett Gregory, 2023, 184384, 67242, -, 13256, 48516, -, 313398, 36.9%, 82.0%
Rudy Sheriff, 2023, 289245, 80000, 3303, 16862, -, -, 389410, 20.5%, 98.0%
Table 2:
Executive Name, Year, Fees/Salary, Cash Bonus, Non-monetary Benefit, Superannuation, Cash settled share based payments, Equity settled share-based payments, Total, Percentage of remuneration related to performance, Cash Bonus as a percentage of maximum achievable
Timothy Burnett, 2022, 104545, -, -, 10455, -, -, 115000, -, -
Greg Fletcher, 2022, 68182, -, -, 6818, -, -, 75000, -, -
Nick Yates, 2022, 68182, -, -, 6818, -, -, 75000, -, -
Mark Benson, 2022, 516813, 280404, 13185, 23568, -, 111300, 945270, 41.4%, 86.0%
Rudy Sheriff, 2022, 326637, 97141, 6192, 23568, -, 34932, 488470, 27.0%, 90.0%
File2
Name,Year,Salary and Fees,Cash Bonus,Non-monetary Benefits,Long Service Leave,Superannuation,Options,Total,Performance Related
I M Williams (Chair),2023,110406,-,-,-,11490,-,121896,NA
I M Williams (Chair),2022,70471,-,-,-,7079,-,77550,NA
R L Green,2023,85317,-,-,-,8881,-,94198,NA
R L Green,2022,63278,-,-,-,6331,-,69609,NA
M R Stubbs,2023,85317,-,-,-,8881,-,94198,NA
M R Stubbs,2022,52853,-,-,-,5285,-,58138,NA
S P Cantwell,2023,87799,-,-,-,6865,-,94664,NA
S P Cantwell,2022,34912,-,-,-,3491,-,38403,NA
A R Kelly (resigned 5 November 2021),2023,-,-,-,-,-,-,-,NA
A R Kelly (resigned 5 November 2021),2022,22305,-,-,-,2233,-,24538,NA
R A Anderson (resigned 31 August 2021),2023,-,-,-,-,-,-,-,NA
R A Anderson (resigned 31 August 2021),2022,14224,-,-,-,1427,-,15651,NA
Sub-Total,2023,368839,-,-,-,36117,-,404956,NA
Sub-Total,2022,258043,-,-,-,25846,-,283889,NA