Building a webapp/startup - what am I missed out on?

I’m desperate for some constructive feedback from people who actually know what they’re doing.

Hi everyone!
This is my first post on this forum, but absolutely not my first visit. Please bear with me while I give you some context:

Quick backstory:
Six years ago, I had to end a startup I’d dedicated three years of my life to. One of the reasons was that we needed… wait for it… AI. We had no idea what was about to arrive. So, about five months ago I decided to give it another go and quit my job. This time I’m by myself. Just one tiny challenge from the get-go: I had never written a line of code in my life.

I’ve gone all-in on this project, relying solely on my savings (which, unfortunately, are dwindling quickly). Because of the steep learning curve into the world of code, diving straight into LLM, python, database/sql, I see that I keep spending time on stuff that is kind of “out of date”(?).

My goal is to build a web app prototype where one of the core functionalities is to extract structured data from business-to-business documents (PDF, DOCX, Excel, etc.). I’m aiming for something better than a basic MVP, something solid enough to use for pilot tests, presales, and to demonstrate to investors.

My one big obstacle at the moment is not something new, but I have yet to find out (or understand?) how to get to the solution: How to successfully convert/extract data from any kind of business document (the typical file types) of “any” size (I would be more than happy with 20 pages of text).

I’m quite confident the solution is already on this forum, but for the unexperienced eye it’s quite hard to find posts that fits my project. I know that I’m missing out on what “true developers” would have done instead of what I’m doing, but I haven’t dared to ask for help before, because I don’t want to be that ‘just another guy who thinks he can build a webapp without any prior experience, coming here for help’. I have spent so much time educating myself. Heck, I haven’t been outdoors for 5 days in a row! I’m so close to the finish line, but after planting my face in my desk I figured it wouldn’t hurt to ask for help. I truly hope for some replies.

For some reason I thought that Chat Completions was something that was going to be replaced by Assistants! (again, steep learning curve)

That’s why I have been spending these past months experimenting and learning how to use Assistants API. After a somewhat disappointing result in my last test run a couple of hours ago, I came across Responses API. When I learned that “Chat Completions is the most used API and will never be discontinued”, it felt like I had to start all over again. I suddenly realized I’ve embarked on a mission thinking I could solve equations while in reality I still need to learn subtraction.

So… here I am, hoping to pick your brain. I’m desperate for some constructive feedback from people who actually know what they’re doing!

This is what I’m trying to achieve:
I’m basically trying to extract every single bit of information found in business-to-business documents that I have identified will be useful in my webapp project. Since I’m trying to standardize data without knowing in advance what kind of information the documents contains, I need the AI to match its findings to a predefined list of data fields, hence the use of structured outputs.

What I’m doing and what I’ve tested
I have set up a postgresql db with about 70 tables and about 300+ columns in total. I’m very satisfied with the results from smaller documents, but I’m struggling once the documents contains 10+ pages. I realized that my instructions and main schema had become too massive/complicated for the AI to run in one go. So I split everything up into about 10 different assistants with fixed instructions and fixed schemas. I used pydantic to create the JSON schemas based on my db tables.

The past six days I’ve been trying to set up a async pipeline where a document goes in → filled JSON schemas come out → database storage.

  • I’ve been testing about 30 different business documents hundreds of times. Mostly PDFs, but also some docx/excel files. From my experience I get more accurate results by processing/sending documents as images using gpt-4o when the documents contain tables.
  • For documents that are basically just text, I’ve used PyPDF2 to send extracted text instead of the actual document, using o3-mini as model. I started doing this after an attempt to reduce the amount of tokens and reduce the time each run took.
  • After encountering errors doing this process with a document that had 14 pages of text, I tried split the text into one chunk per page, sending one chunk after another. That fixed the problem of the run resulting in an error (assistant probably timed out).
  • In another attempt to reduce the token count I’ve set one assistant up with a “JSON questionnaire” in its instructions to start the pipeline, asking questions about the content of the text embedded in the user message. Then I have a script that reads the combinations of yes/no which trigger different assistants (sequential) to use the same thread (where the document text already has been processed by the “questionnaire”), trying to matchmake documents with different assistants. I figured that was the only way I could give the assistants access to the document contents.
  • I’ve tried creating summaries first and then run the summaries instead, but I have yet to find a way to do so without losing vital data.
  • I also spent two days trying to set up a vector db with RAG, but I honestly had no idea what I was doing and if it even would help.

My last attempt resulted in 28 messages and 123,000 tokens (92,000 in, 31,000 out).

So I feel like I’m at a dead-end…! So I’m hoping for some pointers to the right (or at least a better) direction. I would be super grateful for any feedback!

  • Should I start using Response API instead?
  • Are there other any python packages I should use instead/as well?
  • What would you do in my position?
1 Like

To start quickly, I would still use the Assistant API and use Vector Stores. Vector stores creates embeddings based on the data it gets sent and can after this be used to whatever you like. The Responses API does not “save” your embeddings, as it is stateless. But the announcment yesterday mentioned that they will add this to Responses API later and from there make a migration guide from Assistant API to Responses API.

1 Like

Thank you for your reply! Trying to think back at my initial starting point with using OpenAI’s APIs, I think I was trying to find a way to not upload the documents, but I forgot about that when I later realized that there’s no point for me to think data security at the stage where I’m at now. Thank you for reminding me of this. I still don’t understand Vector stores at all… I’ve tried reading up on it but I just can’t wrap my head around it.

I also realized now that I had misunderstood the meaning of “embeddings” (I had to chatgpt your reply, lol).

Vector Stores – Think of this as a smart memory for your AI. When you send information to a vector store, it turns the data into embeddings (special mathematical representations of words or documents).

The idea is to only process a document once: Get the data → done, document is no longer needed. Do you recommend I use Vector Stores to first upload the file, then do a run instructing the assistant to use that file, then have a script to delete it…? Please let me know if it seems like I still didn’t get it :stuck_out_tongue:

Sorry but … I had to intervene …

  1. don’t use Assistants, it will be deprecated mid 2026
  2. don’t use RAG or Vector Stores if you need the AI to take into account EVERY bit of info in your files
  3. Use either Chat Completion or Response mode but don’t attach files in response modes as they will be vectorized and you don’t need that
  4. Preprocess your files by extracting the raw text, if possible as markdown with HTML for the tables (because md tables are too simplistic). For images like screenshots of excel tables, you’ll need to find another solution
  5. Paste/send/(whatever) the md/html/raw text as the first message in your conversation in chat/response mode
  6. be sure to write appropriate instructions for your model, activate JSON output, start with low-cost models and evaluate the answers
  7. Once you’ve found a process that works “manually”, find a programmer/developer and let him automate all this

Using this process will be costly in terms of tokens but it assures you that the LLM will try to take into account every single paragraph in your files …

RAG/Vector Stores are useful when you need to ask questions about your files, not good for summarizing and certainly not complete/thorough if it has to process everything in the file.

Hope this helps

3 Likes

Quitting your job to embark on this project with no prior developer experience was a huge mistake. Despite what hype men promise, “anyone can code” and “AI can write your app” are blatant lies meant to scam investors and depress engineering wages. Now you know better.

Your goal must be to establish your MVP ASAP and get some cash flow so that you can get a professional on board. If it doesn’t work 100%, then cut features until it does, and build them back in later.

Assistants API is deprecated and frankly terrible at what it does, but if you are close to release, finish with it and migrate to Responses or something else later. You have a whole year to deal with that.

Your database schema sounds extremely complicated. Are you aware of the JSONB type? It would probably have simplified things for you. Keep it in mind for the future.

Many people like to use Pydantic for data-driven apps like you describe.

Try to avoid rewriting as much as possible until you have revenue.

1 Like

I am one of thoose “hype men”. I am currently making a course here. Follow it. We’ll get there (ETL pipeline with document type recognition and auto workflow generation - like Make does but without external SaaS) in a few weeks.

1 Like

Thank you! This is why I wrote this post, so I truly appreciate this!

My plan was initially to simply create a demo of the product just to showcase my idea to investors, then get some developers on board. I tried the opposite route before (getting developers to join the project without any money) and learned from that experience.

… but then I tried cursor and it just blew my mind. It has been so fun to let my creativity flow without anything (but my experience) hold me back.

Do you have any tips to which packages I should use for the text extraction into markdown/html?

Again, thank you!

You can prompt a software like this in ~3 hours.

There are other methods for text extraction and you can combine them (especially multiple OCR models when you spatially compare the bounding boxes)…

Follow my course :wink:

I mean I could build a prototype for document identification and text extraction. But who would pay anything for that in these days? Isn’t that business idea completely outdated? I mean 6 years ago that would have been a 200 million entry bid for investors… now every script kiddy can make that in a day (if determined and has some ideas).

Go to make.com and ask their AI to make you a workflow like you just described… it will automatically do that without any coding needed.

That was done in a minute with a single prompt:

I humbly disagree: I’m giving it a shot instead of regretting that I didn’t.

Yes!

I’ve honestly never heard about or talked to anyone like that. I would never have spent every single day for 5 months on my project if this was my mindset. However, using AI as a tool, a tutor and someone to brainstorm with has been awesome! And I’m really glad I did this :grinning_face_with_smiling_eyes:

I became aware of it about three months ago, but didn’t understand it and decided to continue on what was working or I’d never reach the MVP. I will keep it in mind for the future! Hopefully an experienced developer can use my database setup to better understand the business logic during the handover process, and then maybe they will convert it to JSONB!

Who said that was my business idea? :stuck_out_tongue:

But how do you get an AI to remember context that’s only found in one sentence on page 1 when it’s trying to find the contextual data on page 15 to accurately fill one of the many data fields in my schema? I expereince this a lot in cursor that the small details are forgotten if I give it too much material at once. Do you have any ideas?

Graphdatabase + embeddings + rdbms with an Orchestrator that uses tools that grab data from that.

You import it and sort it (which is exactly what my course is about).

It is called context aware chat.

and man, I am a developer with 34+ years of experience and it took me multiple years to build that prior ChatGPT… So how the heck did you think you could do that 9 years ago lol…

That’s not “taking things too easy” that’s madness!

Trying to find devs to do something for free is oin another level of madness.
I’ll go to my porsche dealer an ask them if they can build me a new type of car because I have a cool idea and then see how that turns out for me… lol

1 Like

I’ll check it out! Thank you!

… this seem as complicated, or even more so, than RAG/vector :sweat_smile:
Is your course about learning how to do this? Is it beginner friendly? :joy:

(6 years ago) We were young, fresh out of college and were hoping that someone would see the potential and join our team. Not having AI wasn’t “the reason” to why we shut down. We were mostly talking about machine learning back then - AI wasn’t even a concept in our mind, hehe. Our plan was to do things manually at first in order to get a cash flow (again, the business plan is not to extract text automatically - it is however a huge value-add). We managed to get some funding and spent every dime (about $ 25,000) on a great developer that created a live MVP, but things went sideways just before we were about to sign our first customer after presenting our product.

Now I’m just hoping for some simple pointers, like those from @tilleul were pure gold for me.

Well, I will explain the concepts - you don’t have to look for all the building blocks. I did that for years.

If you knew you would need this and that and that and that and that… before you started you could have made a better business plan and a better concept.

This is how a relation graph of a software can look like:

Where some files (green) have functions in it (red) – hm wait I will change the colors for the color blind

the small red dots represent functions the medium blue ones are folders and the bigger yellow ones are files.

So we have functions in one file and we use the functions in other files.

I used AST parsing to extract the functions and connected them.

But you can also use other stuff like Named Entity Extraction or Topic Extraction and many more to get even more relations - and you can weight them depending on the context of the person*1 who is looking for information.

I did this to get me the files that belong to a file so i can fill a chat and ask the GPT - this file here has this related files and we want to implement X … how would we do that.

and it is a good thing for the gpt model to know that there is a library and which functionalities the library has.

In the end it all comes out to find the right relations for the job. Which can be done using lots of data or like I prefer to do it by manually adjusting each type of data extraction (for now - that will be done differently soon).

*1 btw: In my system I don’t see “persons” - I am writing AI first and call someone who wants data an “employee” which can be an Agentic Pipeline or/and a human (who while doing the jobs trains the Agentic Pipeline)

1 Like

Awesome! I bet it would have taken you a week or two to build that I’ve been trying to set up for months. Aargh, it’s kind of frustrating to talk to people who’s light years ahead :stuck_out_tongue: I’m going to remember this if I ever get a developer on my team. If I wasn’t pressed by money and time, I would definetively check out your course and learn more about this.

1 Like

Let’s get real here…

2 hours vs years

1 Like

SORRY to disappoint you LoWhill, but in my humble experience 99% of those using generative AI have absolutely NO CLUE what they are doing - 99% of our mad Mad MAD Goats-In-White-Coats most assuredly inclusive!

Hence my sage advice is: FORGET all about learning reams of information about how to code, about anything with a steep learning curve and about grunt work generally and concentrate on mastering knowledge that will not very shortly become obsolete. I would also not recommend working on an app that will very shortly become obsolete.

Ergo learn CORE knowledge! PIVOTAL knowledge! Knowledge that will not be obsolete for decades if ever! Think BIG not small! Work on apps that will change the world! As believe you me countless tiny, small and medium sized ideas are going to fall by the wayside in the next few decades!

1 Like

Encouraging words! The only thing I’m dissappointed with is responses like this. I was truly hoping that people would share some experience, tips and tricks.

Who are you to say that my ideas are small and that they won’t change the world?