Financial analysis, Big Data and PDF Table extraction - Best Practice

Hi there…

I am working on a financial anylysis of German so called BWA (business reports made by tax consultants with a tax Software called DATEV) . They are served as table data inside PDFs and have an average Page number of 40 pages.
Here is one example:

I already extracted the table data:, modified the table header (add the dates to the following columns for better understanding for the LLM, cleaned the data frame (deleted NONE Values and empty rows (for smaller in-Token), here is the result) and converted the data into JSON:

[{'Bezeichnung': 'Umsatzerlöse',
  'Mrz/2016': '412575.15',
  'Mrz/2016 % Ges.-Leistg.': '99.95',
  'Jan/2016-Mrz/2016': '1100164.63',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '100.04'},
 {'Bezeichnung': 'Best.Verdg. FE/UE',
  'Mrz/2016': '222.25',
  'Mrz/2016 % Ges.-Leistg.': '0.05',
  'Jan/2016-Mrz/2016': '-450,97',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '-0,04'},
 {'Bezeichnung': 'Akt.Eigenleistungen',
  'Mrz/2016': '0.00',
  'Jan/2016-Mrz/2016': '0.00'},
 {'Bezeichnung': 'Gesamtleistung',
  'Mrz/2016': '412797.40',
  'Mrz/2016 % Ges.-Leistg.': '100.00',
  'Mrz/2016 % Ges.-Kosten': '255.26',
  'Mrz/2016 % Pers.-Kosten': '367.53',
  'Jan/2016-Mrz/2016': '1099713.66',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '100.00',
  'Jan/2016-Mrz/2016 % Ges.-Kosten': '203.92',
  'Jan/2016-Mrz/2016 % Pers.-Kosten': '302.00'},
 {'Bezeichnung': 'Mat./Wareneinkauf',
  'Mrz/2016': '177825.59',
  'Mrz/2016 % Ges.-Leistg.': '43.08',
  'Mrz/2016 % Ges.-Kosten': '109.96',
  'Mrz/2016 % Pers.-Kosten': '158.33',
  'Mrz/2016 Auf-schlag': '100.00',
  'Jan/2016-Mrz/2016': '503196.62',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '45.76',
  'Jan/2016-Mrz/2016 % Ges.-Kosten': '93.31',
  'Jan/2016-Mrz/2016 % Pers.-Kosten': '138.19',
  'Jan/2016-Mrz/2016 Auf-schlag': '100.00'},
 {'Bezeichnung': 'Rohertrag',
  'Mrz/2016': '234971.81',
  'Mrz/2016 % Ges.-Leistg.': '56.92',
  'Mrz/2016 % Ges.-Kosten': '145.30',
  'Mrz/2016 % Pers.-Kosten': '209.21',
  'Mrz/2016 Auf-schlag': '132.14',
  'Jan/2016-Mrz/2016': '596517.04',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '54.24',
  'Jan/2016-Mrz/2016 % Ges.-Kosten': '110.61',
  'Jan/2016-Mrz/2016 % Pers.-Kosten': '163.81',
  'Jan/2016-Mrz/2016 Auf-schlag': '118.55'},
 {'Bezeichnung': 'So. betr. Erlöse',
  'Mrz/2016': '294.12',
  'Mrz/2016 % Ges.-Leistg.': '0.07',
  'Mrz/2016 % Ges.-Kosten': '0.18',
  'Mrz/2016 % Pers.-Kosten': '0.26',
  'Jan/2016-Mrz/2016': '882.36',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '0.08',
  'Jan/2016-Mrz/2016 % Ges.-Kosten': '0.16',
  'Jan/2016-Mrz/2016 % Pers.-Kosten': '0.24'},
 {'Bezeichnung': 'Betriebl. Rohertrag',
  'Mrz/2016': '235265.93',
  'Mrz/2016 % Ges.-Leistg.': '56.99',
  'Mrz/2016 % Ges.-Kosten': '145.48',
  'Mrz/2016 % Pers.-Kosten': '209.47',
  'Mrz/2016 Auf-schlag': '132.30',
  'Jan/2016-Mrz/2016': '597399.40',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '54.32',
  'Jan/2016-Mrz/2016 % Ges.-Kosten': '110.78',
  'Jan/2016-Mrz/2016 % Pers.-Kosten': '164.06',
  'Jan/2016-Mrz/2016 Auf-schlag': '118.72'},
 {'Bezeichnung': 'Kostenarten:'},
 {'Bezeichnung': 'Personalkosten',
  'Mrz/2016': '112316.53',
  'Mrz/2016 % Ges.-Leistg.': '27.21',
  'Mrz/2016 % Ges.-Kosten': '69.45',
  'Mrz/2016 % Pers.-Kosten': '100.00',
  'Jan/2016-Mrz/2016': '364141.58',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '33.11',
  'Jan/2016-Mrz/2016 % Ges.-Kosten': '67.52',
  'Jan/2016-Mrz/2016 % Pers.-Kosten': '100.00'},
 {'Bezeichnung': 'Raumkosten',
  'Mrz/2016': '15414.89',
  'Mrz/2016 % Ges.-Leistg.': '3.73',
  'Mrz/2016 % Ges.-Kosten': '9.53',
  'Mrz/2016 % Pers.-Kosten': '13.72',
  'Jan/2016-Mrz/2016': '46826.52',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '4.26',
  'Jan/2016-Mrz/2016 % Ges.-Kosten': '8.68',
  'Jan/2016-Mrz/2016 % Pers.-Kosten': '12.86'},
 {'Bezeichnung': 'Betriebl. Steuern',
  'Mrz/2016': '568.00',
  'Mrz/2016 % Ges.-Leistg.': '0.14',
  'Mrz/2016 % Ges.-Kosten': '0.35',
  'Mrz/2016 % Pers.-Kosten': '0.51',
  'Jan/2016-Mrz/2016': '2110.00',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '0.19',
  'Jan/2016-Mrz/2016 % Ges.-Kosten': '0.39',
  'Jan/2016-Mrz/2016 % Pers.-Kosten': '0.58'},
 {'Bezeichnung': 'Versich./Beiträge',
  'Mrz/2016': '2371.52',
  'Mrz/2016 % Ges.-Leistg.': '0.57',
  'Mrz/2016 % Ges.-Kosten': '1.47',
  'Mrz/2016 % Pers.-Kosten': '2.11',
  'Jan/2016-Mrz/2016': '14418.78',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '1.31',
  'Jan/2016-Mrz/2016 % Ges.-Kosten': '2.67',
  'Jan/2016-Mrz/2016 % Pers.-Kosten': '3.96'},
 {'Bezeichnung': 'Besondere Kosten',
  'Mrz/2016': '0.00',
  'Jan/2016-Mrz/2016': '0.00'},
 {'Bezeichnung': 'Kfz-Kosten (o. St.)',
  'Mrz/2016': '7168.00',
  'Mrz/2016 % Ges.-Leistg.': '1.74',
  'Mrz/2016 % Ges.-Kosten': '4.43',
  'Mrz/2016 % Pers.-Kosten': '6.38',
  'Jan/2016-Mrz/2016': '32046.92',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '2.91',
  'Jan/2016-Mrz/2016 % Ges.-Kosten': '5.94',
  'Jan/2016-Mrz/2016 % Pers.-Kosten': '8.80'},
 {'Bezeichnung': 'Werbe-/Reisekosten',
  'Mrz/2016': '4497.19',
  'Mrz/2016 % Ges.-Leistg.': '1.09',
  'Mrz/2016 % Ges.-Kosten': '2.78',
  'Mrz/2016 % Pers.-Kosten': '4.00',
  'Jan/2016-Mrz/2016': '11915.39',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '1.08',
  'Jan/2016-Mrz/2016 % Ges.-Kosten': '2.21',
  'Jan/2016-Mrz/2016 % Pers.-Kosten': '3.27'},
 {'Bezeichnung': 'Kosten Warenabgabe',
  'Mrz/2016': '1422.70',
  'Mrz/2016 % Ges.-Leistg.': '0.34',
  'Mrz/2016 % Ges.-Kosten': '0.88',
  'Mrz/2016 % Pers.-Kosten': '1.27',
  'Jan/2016-Mrz/2016': '4562.58',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '0.41',
  'Jan/2016-Mrz/2016 % Ges.-Kosten': '0.85',
  'Jan/2016-Mrz/2016 % Pers.-Kosten': '1.25'},
 {'Bezeichnung': 'Abschreibungen',
  'Mrz/2016': '6124.62',
  'Mrz/2016 % Ges.-Leistg.': '1.48',
  'Mrz/2016 % Ges.-Kosten': '3.79',
  'Mrz/2016 % Pers.-Kosten': '5.45',
  'Jan/2016-Mrz/2016': '18998.51',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '1.73',
  'Jan/2016-Mrz/2016 % Ges.-Kosten': '3.52',
  'Jan/2016-Mrz/2016 % Pers.-Kosten': '5.22'},
 {'Bezeichnung': 'Reparatur/Instandh.',
  'Mrz/2016': '0.00',
  'Jan/2016-Mrz/2016': '7647.13',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '0.70',
  'Jan/2016-Mrz/2016 % Ges.-Kosten': '1.42',
  'Jan/2016-Mrz/2016 % Pers.-Kosten': '2.10'},
 {'Bezeichnung': 'Sonstige Kosten',
  'Mrz/2016': '11835.35',
  'Mrz/2016 % Ges.-Leistg.': '2.87',
  'Mrz/2016 % Ges.-Kosten': '7.32',
  'Mrz/2016 % Pers.-Kosten': '10.54',
  'Jan/2016-Mrz/2016': '36612.01',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '3.33',
  'Jan/2016-Mrz/2016 % Ges.-Kosten': '6.79',
  'Jan/2016-Mrz/2016 % Pers.-Kosten': '10.05'},
 {'Bezeichnung': 'Gesamtkosten',
  'Mrz/2016': '161718.80',
  'Mrz/2016 % Ges.-Leistg.': '39.18',
  'Mrz/2016 % Ges.-Kosten': '100.00',
  'Mrz/2016 % Pers.-Kosten': '143.98',
  'Jan/2016-Mrz/2016': '539279.42',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '49.04',
  'Jan/2016-Mrz/2016 % Ges.-Kosten': '100.00',
  'Jan/2016-Mrz/2016 % Pers.-Kosten': '148.10'},
 {'Bezeichnung': 'Betriebsergebnis',
  'Mrz/2016': '73547.13',
  'Mrz/2016 % Ges.-Leistg.': '17.82',
  'Jan/2016-Mrz/2016': '58119.98',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '5.29'},
 {'Bezeichnung': 'Zinsaufwand',
  'Mrz/2016': '7962.48',
  'Mrz/2016 % Ges.-Leistg.': '1.93',
  'Jan/2016-Mrz/2016': '11853.12',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '1.08'},
 {'Bezeichnung': 'Sonst. neutr. Aufw',
  'Mrz/2016': '0.00',
  'Jan/2016-Mrz/2016': '400.00',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '0.04'},
 {'Bezeichnung': 'Neutraler Aufwand',
  'Mrz/2016': '7962.48',
  'Mrz/2016 % Ges.-Leistg.': '1.93',
  'Jan/2016-Mrz/2016': '12253.12',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '1.11'},
 {'Bezeichnung': 'Zinserträge',
  'Mrz/2016': '61.51',
  'Mrz/2016 % Ges.-Leistg.': '0.01',
  'Jan/2016-Mrz/2016': '61.51',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '0.01'},
 {'Bezeichnung': 'Sonst. neutr. Ertr',
  'Mrz/2016': '0.00',
  'Jan/2016-Mrz/2016': '0.00'},
 {'Bezeichnung': 'Verr. kalk. Kosten',
  'Mrz/2016': '0.00',
  'Jan/2016-Mrz/2016': '0.00'},
 {'Bezeichnung': 'Neutraler Ertrag',
  'Mrz/2016': '61.51',
  'Mrz/2016 % Ges.-Leistg.': '0.01',
  'Jan/2016-Mrz/2016': '61.51',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '0.01'},
 {'Bezeichnung': 'Kontenkl. unbesetzt',
  'Mrz/2016': '0.00',
  'Jan/2016-Mrz/2016': '0.00'},
 {'Bezeichnung': 'Ergebnis vor Steuern',
  'Mrz/2016': '65646.16',
  'Mrz/2016 % Ges.-Leistg.': '15.90',
  'Jan/2016-Mrz/2016': '45928.37',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '4.18'},
 {'Bezeichnung': 'Steuern Eink.u.Ertr',
  'Mrz/2016': '13550.00',
  'Mrz/2016 % Ges.-Leistg.': '3.28',
  'Jan/2016-Mrz/2016': '26910.00',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '2.45'},
 {'Bezeichnung': 'Vorläufiges Ergebnis',
  'Mrz/2016': '52096.16',
  'Mrz/2016 % Ges.-Leistg.': '12.62',
  'Jan/2016-Mrz/2016': '19018.37',
  'Jan/2016-Mrz/2016 % Ges.-Leistg.': '1.73'}

Making a API call looks like that:

from openai import OpenAI
from IPython.display import display

api_key = 'sk-xxx xxx xxx.xxx'
organization = 'org-xxx xxxx xxxx'

client = OpenAI(api_key=api_key, organization=organization)

response = client.chat.completions.create(
  model="gpt-3.5-turbo-1106",
  messages = [
    {
        "role": "system",
# You are a tax consultant who analyzes business evaluations (BWA) for your clients.
        "content": "Du bist  Steuerberater, der betriebswirtschaftliche Auswertungen (BWA) für seine Mandanten analysiert."
    },
    {
        "role": "user",
# Here is the BWA data: {json}
        "content": f"Hier sind die BWA-Daten: {json}"
    },
    {
        "role": "user",
# Analyze the data briefly, concisely and understandably for trends, anomalies, opportunities and risks.
        "content": "Analysiere die Daten kurz, prägnant und verständlich nach Trends, Anomalien, Chancen und Risiken."
    }
],
  temperature=0.8,
  max_tokens=1000,
  top_p=1
)
display(response)

display(response.choices[0].finish_reason)
display(response.choices[0].message.content)

The whole 40 page PDF generates a token length of 57229 Token (and as I told you before, the data has already cleaned up )

So my Questions are:

  1. What you think about the extraction, data cleaning and converting to JSON (especially the key: value assignment): Is it all done well so far or would you do it differently?
  2. Now I’m sending the JSON directly to the API which create a lot of tokens, therefore costs. Is the a better way to send this big data to the LLM for Analysis? Maybe as a Assistent? And open up a new assistant for every analysis?
1 Like

@malte.huener - are you satisfied with results?
This case looks more for an Assistants API than pure completition. We have to remember that this is a language model, not mathematical one. A workaround is to use agent with code interpreter. It will generate code, run it, and ‘interpret’ results.

Hi @tom_t ,

thanks for your answer!

yes, the results are well so far…

But as I had to use gpt-4-1106-preview because of the tokensize, one analysis requires around 60.000 token , which are $0.6 per analysis… So the costa are a bit to high for a business use case right now…
The second challenge is to pass related tax laws to the LLM on which the analysis should be based…

A workaround is to use agent with code interpreter. It will generate code, run it, and ‘interpret’ results.

So you mean, that I will add my table extraction python code into the assistant , the assistant than take the response from that and then analyze it?

Try Assistants API on OopenAI playground to feel what I’m talking about :slight_smile:

1 Like

Hey Tom,

this is my code so far…

import json
import time
from openai import OpenAI
#import chardet


api_key = 'sk-XXX'
organization = 'org-XXX'

client = OpenAI(api_key=api_key, organization=organization)

assistant_id = "asst_XXX"

def get_BWA_data():

    bwa_string = "[{'Bezeichnung': 'Umsatzerlöse', 'Mrz/2016': '412575.15', 'Mrz/2016 % Ges.-Leistg.': '99.95', 'Jan/2016-Mrz/2016': '1100164.63', 'Jan/2016-Mrz/2016 % Ges.-Leistg.': '100.04'}, {'Bezeichnung': 'Best.Verdg. FE/UE', 'Mrz/2016': '222.25', 'Mrz/2016 % Ges.-Leistg.': '0.05', 'Jan/2016-Mrz/2016': '-450,97', 'Jan/2016-Mrz/2016 % Ges.-Leistg.': '-0,04'}, {'Bezeichnung': 'Akt.Eigenleistungen', 'Mrz/2016': '0.00', 'Jan/2016-Mrz/2016': '0.00'}, {'Bezeichnung': 'Gesamtleistung', 'Mrz/2016': '412797.40', 'Mrz/2016 % Ges.-Leistg.': '100.00', 'Mrz/2016 % Ges.-Kosten': '255.26', 'Mrz/2016 % Pers.-Kosten': '367.53', 'Jan/2016-Mrz/2016': '1099713.66', 'Jan/2016-Mrz/2016 % Ges.-Leistg.': '100.00', 'Jan/2016-Mrz/2016 % Ges.-Kosten': '203.92', 'Jan/2016-Mrz/2016 % Pers.-Kosten': '302.00'}]"
    return bwa_string

assistant = client.beta.assistants.retrieve(
    assistant_id = assistant_id
)
assistant

thread = client.beta.threads.create()
thread

def start_and_run():
    """
    This function sends a message to the thread and runs it.
    Parameters:
    content (str): The content of the message to send.
    """
    message = client.beta.threads.messages.create(
        thread_id=thread.id,
        role="user",
        content="Dont load BWA data, just see if u can access the retrieval pdf file-Qjbr45RJyXNHRHVRN5jCJtja",
    )
    
    return client.beta.threads.runs.create(
        thread_id=thread.id,
        assistant_id=assistant.id,
    )


    while run.status == "queued" or run.status == "in_progress":
        run = client.beta.threads.runs.retrieve(
            thread_id=thread.id,
            run_id=run.id,
        )
        
        print(run.status)
        time.sleep(0.5)
    return run

# Function to get the response from a thread
def get_response(thread):
    
    response = client.beta.threads.messages.list(thread_id=thread.id, order="asc")
    for message in response:
        print(f"Nachricht von {message.role}: {message.content[0].text.value}")
    return response

run = wait_on_run(start_and_run())

if run.status == "requires_action":

    tool_call = run.required_action.submit_tool_outputs.tool_calls[0]
    name = tool_call.function.name
    arguments = json.loads(tool_call.function.arguments)

    print("Waiting for custom Function:", name)
    print("Function arguments:")
    print(arguments)

    task = get_BWA_data()
    print(task)
    run = client.beta.threads.runs.submit_tool_outputs(
        thread_id=thread.id,
        run_id=run.id,
        tool_outputs=[
            {
                "tool_call_id": tool_call.id,
                "output": task,
            }
        ],
    )

    run = wait_on_run(run)

    print(get_response(thread))

if run.status == "completed":

    print(get_response(thread))

    message = client.beta.threads.messages.create(
    thread_id=thread.id,
    role="user",
    content="can you access file-Qjbr45RJyXNHRHVRN5jCJtja?",
)

    # Warten auf die Antwort des Assistenten
    run = wait_on_run(start_and_run())

    # Abrufen und Anzeigen der Antwort
    print(get_response(thread))

I can access data via the get_BWA_data function (now with dummy data, later with data as response from external API call - I already tested successfully big data). now I am facing the problem, that the assistant is unable to retrieve data from upload retrieval pdf

message from assistant:
I’m sorry, but it seems that I’m unable to read the content of the file in the expected format. The file may contain characters or encoding that cannot be decoded properly. If you have any other file or if there’s anything else I can assist you with, please let me know.

retrieval in assistant playground for that assistant:

use one of the two models in assistant for reading .pdf’s
gpt-4-turbo-preview
or
gpt-3.5-turbo-0125

they others will not read pdf’s - but I am about to give up and wait for better numerical parsing abilities to be released in the base models, because they just can’t handle complex financial .pdf documents at all right now it seems.

So far I wrote pdf table parsing function by myself, calculate the kpi values with another function and then send the kpi for analyzing to LLM.

Let the LLM do the calculating, I never had a good result…

1 Like

Hi @tom_t ,

thanks for your answer !
How do you mean, use the code interpreter instead? Can give me example?

I have discovered they is a python module by adobe called “Adobe PDF Extract API”.

I think OpenAI should be incorporating this logic into their product in co-operation with adobe. It’s a tax on us as developers needing to format input ourselves for LLM consumption.

I mean really, if OpenAI can build and release a text to video model … then reading a pdf with tables and numbers into a useable LLM content sould be child’s play for OpenAI

1 Like

I have a similar problem. My use-case involves analysis of large tabulated SEBI PDF reports from businesses and extracting a set of required data points (92 in number), then exporting the LLM’s answer to CSV.
The problem is, I have tried the de facto implementations - LangChain with text splitting with vector storage and analysis (the Pinecone way) and also with simple text parsing but I get API timeouts (error 500) or GPT-4 claims the value doesn’t exist in the PDF when it does. Asking it for 5-6 data points at a time yields correct answers.
Would feeding the API a JSON instead help my case more? and how do I accomplish that?