GPT4-turbo-preview & complex financial statements

Whistler.bc.ca · February 12, 2024, 11:18pm

I was hoping to develop a model that could read and analyze Annual Financial reports, extract statement components to xlsx files. But it does not seem like the numerical extraction abilities for complex financial statements in .pdf files is realty there yet. GPT4-turbo-preview really wants to try hard to do it and I blew thru my rate limits today with several different corporate reports, pretty quickly.

I asked GPT what the OpenAI development plan is for capability improvements in complex document processing and here is the answer which may be of interest to others with similar aspirations to mine:

Improvements in text extraction techniques, especially in the context of processing complex document formats like PDFs, are ongoing and driven by both advancements in technology and growing demands for data accuracy and efficiency in analysis. The landscape of document processing and text extraction technologies evolves rapidly, thanks to continuous research and development in areas such as machine learning, natural language processing (NLP), and document understanding AI.

Expectations for Future Improvements:

1. Short-Term (1-2 Years):

- Incremental Enhancements: Improvements in existing document parsing libraries to handle a wider variety of PDF layouts and structures more accurately.*
- Better Integration: Enhanced integration capabilities with other AI technologies, such as named entity recognition (NER), to improve the contextual understanding of extracted data.*

2. Medium-Term (2-5 Years):

- Machine Learning Models: Development and adoption of more sophisticated machine learning models that can understand and interpret the context of documents much like a human reader.*
- Semantic Analysis Capabilities: Improved semantic analysis to discern the meaning and significance of text in financial documents, enabling more nuanced data extraction.*

3. Long-Term (5+ Years):

- General AI Integration: Potential integration of general AI advancements, allowing for near-human or superior levels of document comprehension and analysis.*
- Standardization and Automation: Greater standardization in financial reporting and document formats could lead to highly automated and accurate extraction methods tailored to specific industries.*

Factors Influencing Improvements:
** Technological Advances: Continuous research in AI and NLP fields will lead to more powerful algorithms capable of understanding complex documents.*

** Industry Demand: The growing need for fast and accurate data analysis across industries will drive demand for better extraction tools.*

** Regulatory Changes: Changes in reporting standards and requirements may prompt the development of specialized extraction technologies to comply with new norms.*

Staying Ahead:
To leverage these future improvements effectively, organizations and individuals should:

** Stay Informed: Keep abreast of developments in document processing technologies and data analysis tools.*

** Be Adaptable: Be ready to adopt new tools and techniques that can enhance accuracy and efficiency.*

** Invest in Skills: Encourage skill development in areas related to AI, machine learning, and data analytics among team members responsible for financial analysis.*

While it’s challenging to predict the exact timing of specific advancements, the trajectory clearly points toward continuous improvement in text extraction capabilities, enabling more accurate and insightful financial document analysis over time.

Conclusion:
So not there yet and will not be happening for a while with the base models.

Whistler.bc.ca · February 12, 2024, 11:25pm

My challenges over the past 2 weeks had been in try to get GPT3.5 and GPT4 to extract complete financial numbers / values from uploaded statements embedded in one pdf. The best I could do was on one occasion get 90% accurate results written to an xlsx file once.

I have been working in the playground environment.

Bottom line is that a .pdf is very difficult to extract numbers from columns if you are a GPT

Diet · February 13, 2024, 2:47am

But you know that’s just random meaningless AI BS, right?

I’m ~~pretty~~ confident that you can do it with the tools of today, if you refine your process a little!

have you considered using other OCR or extraction tools before sending the data to the LLM? Sometimes it helps splitting tables up by rows, or turning tables into objects.

Whistler.bc.ca · February 13, 2024, 8:50pm

Thanks for the advice Diet. I’ve just got myself up and running with my python IDE and talking to my OpenAI account api. So I am going to deploy your suggestion asap.

The random meaningless AI BS is something new for me.

Your input is really appreciated

Topic		Replies	Views
Automating for mundane tasks Prompting gpt-4 , chatgpt , ocr , programming	16	552	March 19, 2025
Problem extracting data from PDF files and comparing them Prompting gpt-4 , chatgpt	20	5448	June 7, 2025
How to solve the problem that GPT-API cannot read text using OCR? API	19	3910	July 10, 2024
How to Process PDF Files with OpenAI's Tools and APIs for Invoice Automation? API api , gpt-4-vision , ocr	1	1125	January 15, 2025
Handling High-Volume Numerical Data API	4	673	April 9, 2024

GPT4-turbo-preview & complex financial statements

Related topics