I’m desperate for some constructive feedback from people who actually know what they’re doing.
Hi everyone!
This is my first post on this forum, but absolutely not my first visit. Please bear with me while I give you some context:
Quick backstory:
Six years ago, I had to end a startup I’d dedicated three years of my life to. One of the reasons was that we needed… wait for it… AI. We had no idea what was about to arrive. So, about five months ago I decided to give it another go and quit my job. This time I’m by myself. Just one tiny challenge from the get-go: I had never written a line of code in my life.
I’ve gone all-in on this project, relying solely on my savings (which, unfortunately, are dwindling quickly). Because of the steep learning curve into the world of code, diving straight into LLM, python, database/sql, I see that I keep spending time on stuff that is kind of “out of date”(?).
My goal is to build a web app prototype where one of the core functionalities is to extract structured data from business-to-business documents (PDF, DOCX, Excel, etc.). I’m aiming for something better than a basic MVP, something solid enough to use for pilot tests, presales, and to demonstrate to investors.
My one big obstacle at the moment is not something new, but I have yet to find out (or understand?) how to get to the solution: How to successfully convert/extract data from any kind of business document (the typical file types) of “any” size (I would be more than happy with 20 pages of text).
I’m quite confident the solution is already on this forum, but for the unexperienced eye it’s quite hard to find posts that fits my project. I know that I’m missing out on what “true developers” would have done instead of what I’m doing, but I haven’t dared to ask for help before, because I don’t want to be that ‘just another guy who thinks he can build a webapp without any prior experience, coming here for help’. I have spent so much time educating myself. Heck, I haven’t been outdoors for 5 days in a row! I’m so close to the finish line, but after planting my face in my desk I figured it wouldn’t hurt to ask for help. I truly hope for some replies.
For some reason I thought that Chat Completions was something that was going to be replaced by Assistants! (again, steep learning curve)
That’s why I have been spending these past months experimenting and learning how to use Assistants API. After a somewhat disappointing result in my last test run a couple of hours ago, I came across Responses API. When I learned that “Chat Completions is the most used API and will never be discontinued”, it felt like I had to start all over again. I suddenly realized I’ve embarked on a mission thinking I could solve equations while in reality I still need to learn subtraction.
So… here I am, hoping to pick your brain. I’m desperate for some constructive feedback from people who actually know what they’re doing!
This is what I’m trying to achieve:
I’m basically trying to extract every single bit of information found in business-to-business documents that I have identified will be useful in my webapp project. Since I’m trying to standardize data without knowing in advance what kind of information the documents contains, I need the AI to match its findings to a predefined list of data fields, hence the use of structured outputs.
What I’m doing and what I’ve tested
I have set up a postgresql db with about 70 tables and about 300+ columns in total. I’m very satisfied with the results from smaller documents, but I’m struggling once the documents contains 10+ pages. I realized that my instructions and main schema had become too massive/complicated for the AI to run in one go. So I split everything up into about 10 different assistants with fixed instructions and fixed schemas. I used pydantic to create the JSON schemas based on my db tables.
The past six days I’ve been trying to set up a async pipeline where a document goes in → filled JSON schemas come out → database storage.
- I’ve been testing about 30 different business documents hundreds of times. Mostly PDFs, but also some docx/excel files. From my experience I get more accurate results by processing/sending documents as images using gpt-4o when the documents contain tables.
- For documents that are basically just text, I’ve used PyPDF2 to send extracted text instead of the actual document, using o3-mini as model. I started doing this after an attempt to reduce the amount of tokens and reduce the time each run took.
- After encountering errors doing this process with a document that had 14 pages of text, I tried split the text into one chunk per page, sending one chunk after another. That fixed the problem of the run resulting in an error (assistant probably timed out).
- In another attempt to reduce the token count I’ve set one assistant up with a “JSON questionnaire” in its instructions to start the pipeline, asking questions about the content of the text embedded in the user message. Then I have a script that reads the combinations of yes/no which trigger different assistants (sequential) to use the same thread (where the document text already has been processed by the “questionnaire”), trying to matchmake documents with different assistants. I figured that was the only way I could give the assistants access to the document contents.
- I’ve tried creating summaries first and then run the summaries instead, but I have yet to find a way to do so without losing vital data.
- I also spent two days trying to set up a vector db with RAG, but I honestly had no idea what I was doing and if it even would help.
My last attempt resulted in 28 messages and 123,000 tokens (92,000 in, 31,000 out).
So I feel like I’m at a dead-end…! So I’m hoping for some pointers to the right (or at least a better) direction. I would be super grateful for any feedback!
- Should I start using Response API instead?
- Are there other any python packages I should use instead/as well?
- What would you do in my position?