Financial analysis, Big Data and PDF Table extraction - Best Practice

malte.huener · January 18, 2024, 5:57am

Hi there…

I am working on a financial anylysis of German so called BWA (business reports made by tax consultants with a tax Software called DATEV) . They are served as table data inside PDFs and have an average Page number of 40 pages.
Here is one example:

I already extracted the table data:, modified the table header (add the dates to the following columns for better understanding for the LLM, cleaned the data frame (deleted NONE Values and empty rows (for smaller in-Token), here is the result) and converted the data into JSON:

[{'Bezeichnung': 'Umsatzerlöse',
  'Mrz/2016': '412575.15',
  'Mrz/2016 % Ges.-Leistg.': '99.95',
  'Jan/2016-Mrz/2016': '1100164.63',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '100.04'},
 {'Bezeichnung': 'Best.Verdg. FE/UE',
  'Mrz/2016': '222.25',
  'Mrz/2016 % Ges.-Leistg.': '0.05',
  'Jan/2016-Mrz/2016': '-450,97',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '-0,04'},
 {'Bezeichnung': 'Akt.Eigenleistungen',
  'Mrz/2016': '0.00',
  'Jan/2016-Mrz/2016': '0.00'},
 {'Bezeichnung': 'Gesamtleistung',
  'Mrz/2016': '412797.40',
  'Mrz/2016 % Ges.-Leistg.': '100.00',
  'Mrz/2016 % Ges.-Kosten': '255.26',
  'Mrz/2016 % Pers.-Kosten': '367.53',
  'Jan/2016-Mrz/2016': '1099713.66',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '100.00',
  'Jan/2016-Mrz/2016 % Ges.-Kosten': '203.92',
  'Jan/2016-Mrz/2016 % Pers.-Kosten': '302.00'},
 {'Bezeichnung': 'Mat./Wareneinkauf',
  'Mrz/2016': '177825.59',
  'Mrz/2016 % Ges.-Leistg.': '43.08',
  'Mrz/2016 % Ges.-Kosten': '109.96',
  'Mrz/2016 % Pers.-Kosten': '158.33',
  'Mrz/2016 Auf-schlag': '100.00',
  'Jan/2016-Mrz/2016': '503196.62',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '45.76',
  'Jan/2016-Mrz/2016 % Ges.-Kosten': '93.31',
  'Jan/2016-Mrz/2016 % Pers.-Kosten': '138.19',
  'Jan/2016-Mrz/2016 Auf-schlag': '100.00'},
 {'Bezeichnung': 'Rohertrag',
  'Mrz/2016': '234971.81',
  'Mrz/2016 % Ges.-Leistg.': '56.92',
  'Mrz/2016 % Ges.-Kosten': '145.30',
  'Mrz/2016 % Pers.-Kosten': '209.21',
  'Mrz/2016 Auf-schlag': '132.14',
  'Jan/2016-Mrz/2016': '596517.04',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '54.24',
  'Jan/2016-Mrz/2016 % Ges.-Kosten': '110.61',
  'Jan/2016-Mrz/2016 % Pers.-Kosten': '163.81',
  'Jan/2016-Mrz/2016 Auf-schlag': '118.55'},
 {'Bezeichnung': 'So. betr. Erlöse',
  'Mrz/2016': '294.12',
  'Mrz/2016 % Ges.-Leistg.': '0.07',
  'Mrz/2016 % Ges.-Kosten': '0.18',
  'Mrz/2016 % Pers.-Kosten': '0.26',
  'Jan/2016-Mrz/2016': '882.36',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '0.08',
  'Jan/2016-Mrz/2016 % Ges.-Kosten': '0.16',
  'Jan/2016-Mrz/2016 % Pers.-Kosten': '0.24'},
 {'Bezeichnung': 'Betriebl. Rohertrag',
  'Mrz/2016': '235265.93',
  'Mrz/2016 % Ges.-Leistg.': '56.99',
  'Mrz/2016 % Ges.-Kosten': '145.48',
  'Mrz/2016 % Pers.-Kosten': '209.47',
  'Mrz/2016 Auf-schlag': '132.30',
  'Jan/2016-Mrz/2016': '597399.40',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '54.32',
  'Jan/2016-Mrz/2016 % Ges.-Kosten': '110.78',
  'Jan/2016-Mrz/2016 % Pers.-Kosten': '164.06',
  'Jan/2016-Mrz/2016 Auf-schlag': '118.72'},
 {'Bezeichnung': 'Kostenarten:'},
 {'Bezeichnung': 'Personalkosten',
  'Mrz/2016': '112316.53',
  'Mrz/2016 % Ges.-Leistg.': '27.21',
  'Mrz/2016 % Ges.-Kosten': '69.45',
  'Mrz/2016 % Pers.-Kosten': '100.00',
  'Jan/2016-Mrz/2016': '364141.58',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '33.11',
  'Jan/2016-Mrz/2016 % Ges.-Kosten': '67.52',
  'Jan/2016-Mrz/2016 % Pers.-Kosten': '100.00'},
 {'Bezeichnung': 'Raumkosten',
  'Mrz/2016': '15414.89',
  'Mrz/2016 % Ges.-Leistg.': '3.73',
  'Mrz/2016 % Ges.-Kosten': '9.53',
  'Mrz/2016 % Pers.-Kosten': '13.72',
  'Jan/2016-Mrz/2016': '46826.52',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '4.26',
  'Jan/2016-Mrz/2016 % Ges.-Kosten': '8.68',
  'Jan/2016-Mrz/2016 % Pers.-Kosten': '12.86'},
 {'Bezeichnung': 'Betriebl. Steuern',
  'Mrz/2016': '568.00',
  'Mrz/2016 % Ges.-Leistg.': '0.14',
  'Mrz/2016 % Ges.-Kosten': '0.35',
  'Mrz/2016 % Pers.-Kosten': '0.51',
  'Jan/2016-Mrz/2016': '2110.00',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '0.19',
  'Jan/2016-Mrz/2016 % Ges.-Kosten': '0.39',
  'Jan/2016-Mrz/2016 % Pers.-Kosten': '0.58'},
 {'Bezeichnung': 'Versich./Beiträge',
  'Mrz/2016': '2371.52',
  'Mrz/2016 % Ges.-Leistg.': '0.57',
  'Mrz/2016 % Ges.-Kosten': '1.47',
  'Mrz/2016 % Pers.-Kosten': '2.11',
  'Jan/2016-Mrz/2016': '14418.78',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '1.31',
  'Jan/2016-Mrz/2016 % Ges.-Kosten': '2.67',
  'Jan/2016-Mrz/2016 % Pers.-Kosten': '3.96'},
 {'Bezeichnung': 'Besondere Kosten',
  'Mrz/2016': '0.00',
  'Jan/2016-Mrz/2016': '0.00'},
 {'Bezeichnung': 'Kfz-Kosten (o. St.)',
  'Mrz/2016': '7168.00',
  'Mrz/2016 % Ges.-Leistg.': '1.74',
  'Mrz/2016 % Ges.-Kosten': '4.43',
  'Mrz/2016 % Pers.-Kosten': '6.38',
  'Jan/2016-Mrz/2016': '32046.92',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '2.91',
  'Jan/2016-Mrz/2016 % Ges.-Kosten': '5.94',
  'Jan/2016-Mrz/2016 % Pers.-Kosten': '8.80'},
 {'Bezeichnung': 'Werbe-/Reisekosten',
  'Mrz/2016': '4497.19',
  'Mrz/2016 % Ges.-Leistg.': '1.09',
  'Mrz/2016 % Ges.-Kosten': '2.78',
  'Mrz/2016 % Pers.-Kosten': '4.00',
  'Jan/2016-Mrz/2016': '11915.39',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '1.08',
  'Jan/2016-Mrz/2016 % Ges.-Kosten': '2.21',
  'Jan/2016-Mrz/2016 % Pers.-Kosten': '3.27'},
 {'Bezeichnung': 'Kosten Warenabgabe',
  'Mrz/2016': '1422.70',
  'Mrz/2016 % Ges.-Leistg.': '0.34',
  'Mrz/2016 % Ges.-Kosten': '0.88',
  'Mrz/2016 % Pers.-Kosten': '1.27',
  'Jan/2016-Mrz/2016': '4562.58',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '0.41',
  'Jan/2016-Mrz/2016 % Ges.-Kosten': '0.85',
  'Jan/2016-Mrz/2016 % Pers.-Kosten': '1.25'},
 {'Bezeichnung': 'Abschreibungen',
  'Mrz/2016': '6124.62',
  'Mrz/2016 % Ges.-Leistg.': '1.48',
  'Mrz/2016 % Ges.-Kosten': '3.79',
  'Mrz/2016 % Pers.-Kosten': '5.45',
  'Jan/2016-Mrz/2016': '18998.51',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '1.73',
  'Jan/2016-Mrz/2016 % Ges.-Kosten': '3.52',
  'Jan/2016-Mrz/2016 % Pers.-Kosten': '5.22'},
 {'Bezeichnung': 'Reparatur/Instandh.',
  'Mrz/2016': '0.00',
  'Jan/2016-Mrz/2016': '7647.13',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '0.70',
  'Jan/2016-Mrz/2016 % Ges.-Kosten': '1.42',
  'Jan/2016-Mrz/2016 % Pers.-Kosten': '2.10'},
 {'Bezeichnung': 'Sonstige Kosten',
  'Mrz/2016': '11835.35',
  'Mrz/2016 % Ges.-Leistg.': '2.87',
  'Mrz/2016 % Ges.-Kosten': '7.32',
  'Mrz/2016 % Pers.-Kosten': '10.54',
  'Jan/2016-Mrz/2016': '36612.01',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '3.33',
  'Jan/2016-Mrz/2016 % Ges.-Kosten': '6.79',
  'Jan/2016-Mrz/2016 % Pers.-Kosten': '10.05'},
 {'Bezeichnung': 'Gesamtkosten',
  'Mrz/2016': '161718.80',
  'Mrz/2016 % Ges.-Leistg.': '39.18',
  'Mrz/2016 % Ges.-Kosten': '100.00',
  'Mrz/2016 % Pers.-Kosten': '143.98',
  'Jan/2016-Mrz/2016': '539279.42',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '49.04',
  'Jan/2016-Mrz/2016 % Ges.-Kosten': '100.00',
  'Jan/2016-Mrz/2016 % Pers.-Kosten': '148.10'},
 {'Bezeichnung': 'Betriebsergebnis',
  'Mrz/2016': '73547.13',
  'Mrz/2016 % Ges.-Leistg.': '17.82',
  'Jan/2016-Mrz/2016': '58119.98',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '5.29'},
 {'Bezeichnung': 'Zinsaufwand',
  'Mrz/2016': '7962.48',
  'Mrz/2016 % Ges.-Leistg.': '1.93',
  'Jan/2016-Mrz/2016': '11853.12',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '1.08'},
 {'Bezeichnung': 'Sonst. neutr. Aufw',
  'Mrz/2016': '0.00',
  'Jan/2016-Mrz/2016': '400.00',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '0.04'},
 {'Bezeichnung': 'Neutraler Aufwand',
  'Mrz/2016': '7962.48',
  'Mrz/2016 % Ges.-Leistg.': '1.93',
  'Jan/2016-Mrz/2016': '12253.12',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '1.11'},
 {'Bezeichnung': 'Zinserträge',
  'Mrz/2016': '61.51',
  'Mrz/2016 % Ges.-Leistg.': '0.01',
  'Jan/2016-Mrz/2016': '61.51',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '0.01'},
 {'Bezeichnung': 'Sonst. neutr. Ertr',
  'Mrz/2016': '0.00',
  'Jan/2016-Mrz/2016': '0.00'},
 {'Bezeichnung': 'Verr. kalk. Kosten',
  'Mrz/2016': '0.00',
  'Jan/2016-Mrz/2016': '0.00'},
 {'Bezeichnung': 'Neutraler Ertrag',
  'Mrz/2016': '61.51',
  'Mrz/2016 % Ges.-Leistg.': '0.01',
  'Jan/2016-Mrz/2016': '61.51',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '0.01'},
 {'Bezeichnung': 'Kontenkl. unbesetzt',
  'Mrz/2016': '0.00',
  'Jan/2016-Mrz/2016': '0.00'},
 {'Bezeichnung': 'Ergebnis vor Steuern',
  'Mrz/2016': '65646.16',
  'Mrz/2016 % Ges.-Leistg.': '15.90',
  'Jan/2016-Mrz/2016': '45928.37',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '4.18'},
 {'Bezeichnung': 'Steuern Eink.u.Ertr',
  'Mrz/2016': '13550.00',
  'Mrz/2016 % Ges.-Leistg.': '3.28',
  'Jan/2016-Mrz/2016': '26910.00',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '2.45'},
 {'Bezeichnung': 'Vorläufiges Ergebnis',
  'Mrz/2016': '52096.16',
  'Mrz/2016 % Ges.-Leistg.': '12.62',
  'Jan/2016-Mrz/2016': '19018.37',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '1.73'}

Making a API call looks like that:

from openai import OpenAI
from IPython.display import display

api_key = 'sk-xxx xxx xxx.xxx'
organization = 'org-xxx xxxx xxxx'

client = OpenAI(api_key=api_key, organization=organization)

response = client.chat.completions.create(
  model="gpt-3.5-turbo-1106",
  messages = [
    {
        "role": "system",
# You are a tax consultant who analyzes business evaluations (BWA) for your clients.
        "content": "Du bist  Steuerberater, der betriebswirtschaftliche Auswertungen (BWA) für seine Mandanten analysiert."
    },
    {
        "role": "user",
# Here is the BWA data: {json}
        "content": f"Hier sind die BWA-Daten: {json}"
    },
    {
        "role": "user",
# Analyze the data briefly, concisely and understandably for trends, anomalies, opportunities and risks.
        "content": "Analysiere die Daten kurz, prägnant und verständlich nach Trends, Anomalien, Chancen und Risiken."
    }
],
  temperature=0.8,
  max_tokens=1000,
  top_p=1
)
display(response)

display(response.choices[0].finish_reason)
display(response.choices[0].message.content)

The whole 40 page PDF generates a token length of 57229 Token (and as I told you before, the data has already cleaned up )

So my Questions are:

What you think about the extraction, data cleaning and converting to JSON (especially the key: value assignment): Is it all done well so far or would you do it differently?
Now I’m sending the JSON directly to the API which create a lot of tokens, therefore costs. Is the a better way to send this big data to the LLM for Analysis? Maybe as a Assistent? And open up a new assistant for every analysis?

tom_t · January 18, 2024, 11:43am

@malte.huener - are you satisfied with results?
This case looks more for an Assistants API than pure completition. We have to remember that this is a language model, not mathematical one. A workaround is to use agent with code interpreter. It will generate code, run it, and ‘interpret’ results.

malte.huener · January 18, 2024, 12:05pm

Hi @tom_t ,

thanks for your answer!

yes, the results are well so far…

But as I had to use gpt-4-1106-preview because of the tokensize, one analysis requires around 60.000 token , which are $0.6 per analysis… So the costa are a bit to high for a business use case right now…
The second challenge is to pass related tax laws to the LLM on which the analysis should be based…

A workaround is to use agent with code interpreter. It will generate code, run it, and ‘interpret’ results.

So you mean, that I will add my table extraction python code into the assistant , the assistant than take the response from that and then analyze it?

tom_t · January 18, 2024, 1:12pm

Try Assistants API on OopenAI playground to feel what I’m talking about

malte.huener · January 31, 2024, 10:00am

Hey Tom,

this is my code so far…

import json
import time
from openai import OpenAI
#import chardet


api_key = 'sk-XXX'
organization = 'org-XXX'

client = OpenAI(api_key=api_key, organization=organization)

assistant_id = "asst_XXX"

def get_BWA_data():

    bwa_string = "[{'Bezeichnung': 'Umsatzerlöse', 'Mrz/2016': '412575.15', 'Mrz/2016 % Ges.-Leistg.': '99.95', 'Jan/2016-Mrz/2016': '1100164.63', 'Jan/2016-Mrz/2016 % Ges.-Leistg.': '100.04'}, {'Bezeichnung': 'Best.Verdg. FE/UE', 'Mrz/2016': '222.25', 'Mrz/2016 % Ges.-Leistg.': '0.05', 'Jan/2016-Mrz/2016': '-450,97', 'Jan/2016-Mrz/2016 % Ges.-Leistg.': '-0,04'}, {'Bezeichnung': 'Akt.Eigenleistungen', 'Mrz/2016': '0.00', 'Jan/2016-Mrz/2016': '0.00'}, {'Bezeichnung': 'Gesamtleistung', 'Mrz/2016': '412797.40', 'Mrz/2016 % Ges.-Leistg.': '100.00', 'Mrz/2016 % Ges.-Kosten': '255.26', 'Mrz/2016 % Pers.-Kosten': '367.53', 'Jan/2016-Mrz/2016': '1099713.66', 'Jan/2016-Mrz/2016 % Ges.-Leistg.': '100.00', 'Jan/2016-Mrz/2016 % Ges.-Kosten': '203.92', 'Jan/2016-Mrz/2016 % Pers.-Kosten': '302.00'}]"
    return bwa_string

assistant = client.beta.assistants.retrieve(
    assistant_id = assistant_id
)
assistant

thread = client.beta.threads.create()
thread

def start_and_run():
    """
    This function sends a message to the thread and runs it.
    Parameters:
    content (str): The content of the message to send.
    """
    message = client.beta.threads.messages.create(
        thread_id=thread.id,
        role="user",
        content="Dont load BWA data, just see if u can access the retrieval pdf file-Qjbr45RJyXNHRHVRN5jCJtja",
    )
    
    return client.beta.threads.runs.create(
        thread_id=thread.id,
        assistant_id=assistant.id,
    )


    while run.status == "queued" or run.status == "in_progress":
        run = client.beta.threads.runs.retrieve(
            thread_id=thread.id,
            run_id=run.id,
        )
        
        print(run.status)
        time.sleep(0.5)
    return run

# Function to get the response from a thread
def get_response(thread):
    
    response = client.beta.threads.messages.list(thread_id=thread.id, order="asc")
    for message in response:
        print(f"Nachricht von {message.role}: {message.content[0].text.value}")
    return response

run = wait_on_run(start_and_run())

if run.status == "requires_action":

    tool_call = run.required_action.submit_tool_outputs.tool_calls[0]
    name = tool_call.function.name
    arguments = json.loads(tool_call.function.arguments)

    print("Waiting for custom Function:", name)
    print("Function arguments:")
    print(arguments)

    task = get_BWA_data()
    print(task)
    run = client.beta.threads.runs.submit_tool_outputs(
        thread_id=thread.id,
        run_id=run.id,
        tool_outputs=[
            {
                "tool_call_id": tool_call.id,
                "output": task,
            }
        ],
    )

    run = wait_on_run(run)

    print(get_response(thread))

if run.status == "completed":

    print(get_response(thread))

    message = client.beta.threads.messages.create(
    thread_id=thread.id,
    role="user",
    content="can you access file-Qjbr45RJyXNHRHVRN5jCJtja?",
)

    # Warten auf die Antwort des Assistenten
    run = wait_on_run(start_and_run())

    # Abrufen und Anzeigen der Antwort
    print(get_response(thread))

I can access data via the get_BWA_data function (now with dummy data, later with data as response from external API call - I already tested successfully big data). now I am facing the problem, that the assistant is unable to retrieve data from upload retrieval pdf

message from assistant:
I’m sorry, but it seems that I’m unable to read the content of the file in the expected format. The file may contain characters or encoding that cannot be decoded properly. If you have any other file or if there’s anything else I can assist you with, please let me know.

retrieval in assistant playground for that assistant:

Whistler.bc.ca · February 12, 2024, 11:02pm

use one of the two models in assistant for reading .pdf’s
gpt-4-turbo-preview
or
gpt-3.5-turbo-0125

they others will not read pdf’s - but I am about to give up and wait for better numerical parsing abilities to be released in the base models, because they just can’t handle complex financial .pdf documents at all right now it seems.

malte.huener · February 16, 2024, 10:26am

So far I wrote pdf table parsing function by myself, calculate the kpi values with another function and then send the kpi for analyzing to LLM.

Let the LLM do the calculating, I never had a good result…

malte.huener · February 16, 2024, 10:29am

Hi @tom_t ,

thanks for your answer !
How do you mean, use the code interpreter instead? Can give me example?

Whistler.bc.ca · February 17, 2024, 9:05pm

I have discovered they is a python module by adobe called “Adobe PDF Extract API”.

I think OpenAI should be incorporating this logic into their product in co-operation with adobe. It’s a tax on us as developers needing to format input ourselves for LLM consumption.

I mean really, if OpenAI can build and release a text to video model … then reading a pdf with tables and numbers into a useable LLM content sould be child’s play for OpenAI

silverfox · February 20, 2024, 12:28pm

I have a similar problem. My use-case involves analysis of large tabulated SEBI PDF reports from businesses and extracting a set of required data points (92 in number), then exporting the LLM’s answer to CSV.
The problem is, I have tried the de facto implementations - LangChain with text splitting with vector storage and analysis (the Pinecone way) and also with simple text parsing but I get API timeouts (error 500) or GPT-4 claims the value doesn’t exist in the PDF when it does. Asking it for 5-6 data points at a time yields correct answers.
Would feeding the API a JSON instead help my case more? and how do I accomplish that?

basarseven · April 25, 2024, 8:13am

Malte Hi, könnten wir uns unterhalten. Ich bin Dozent und würde gerne mit dir über dein problem sprechen und wie du es gelöst hast.
LG

malte.huener · April 29, 2024, 1:13am

Hi @basarseven , können wir gerne machen…
Du erreichst mich unter malte.huener@schlau-digital.de

Topic		Replies	Views
Finetuning model to learn from my product data API	13	2861	February 16, 2023
CSV File How to best parse larger data? API	12	3507	December 19, 2023
Search long pdf for specific table - possibly need fine tuning model API gpt-4 , fine-tuning , api	10	3167	March 29, 2024
Chat completion api tool call loops API api , tools	15	1903	August 6, 2024
Difference of GPT's and Assistants Community custom-gpt , assistants , assistants-api , chatgpt-gpt , tp-1	25	17502	February 6, 2024

Financial analysis, Big Data and PDF Table extraction - Best Practice

Related topics