Document processing solutions

Have anyone figured out a solution for this yet? I have a PDF as a URL and I’m thinking of using either Node.js or Python to process the PDF and feed it into the GPT API.

The only problem is that I need the output to always have the same structure, and the input contains multiple pages, different structures, or even different languages. I’m not sure how to handle this. Any ideas?

I’ve been working on chunking large data with my codes, but I’m having trouble getting them to work smoothly. Right now, the output is coming out a bit jumbled and not quite the way I want it to.

Real life examples: B2B orders from various customers in PDF format? Streamline the process by breaking down each order and inputting the important details into the system. Using AI to analyze key parts of the order will save time and avoid the headache of manually entering the same information over and over again. With each customer using different formats, headers, and languages, need the solution to the order processing woes.

you should use gpt itself to structure raw text data.

We use Instructor (search for jxnl instructor) to get structured JSON output from PDF documents. It has Python and JS versions. We currently OCR the documents first but have started to get pretty good table extraction results from multimodal LLMs.

At LAWXER we have split the thing into 3 “engines” :

  • comprehension: OCR + raw text sanitisation and formatting + semantic structure analysis
  • analysis: data extraction, logic operations
  • report: data and results output

Real life usage: legal document analysis in pre signature stages

Your case technically is very similar (except that the output is much simpler).

As a rule of thumb, in the early stages of the system design, don’t think about the code but rather think in terms of how the human does the task.

It took us 1 year and a half to analyse the way humans do the tasks we needed to do and draw the workflow, and about 6 months to go from backlog to working prototype

I will give it a try. Thanks @hudge

I tackled something similar but ran into trouble with large documents that had to be formatted a certain way. Just a simple bulleted list would do the trick. Our situations are pretty much the same. The great thing is we’ve pinpointed the key factors we need from those documents.

Why did it take so long? Were you trying to figure out exactly what information you needed from those documents? In our situation, we know exactly what output we want to get from any of those PDFs. @sergeliatko

Because people usually are not capable of describing subconscious workflows. Especially those with a lot of experience. So we had to start from a blank sheet and draw “common” text comprehension module first (about 3 months), then “common” analysis module (another 4 months), then “common” lawyer’s comprehension workflow and analysis module (another 2 months). And data collection, training, concepts adjustments (another 3 months)… Hopefully linguistics and psychology background were here to speed up.

No, the engine we built “understands” raw text structure and builds data objects based on the source, not the desired outcome. Then we use RAG on found elements to extract any data we need or confirm the absence of the info we are looking for. We also can handle the contradiction of elements and style inconsistencies.

In the “comprehension” module we have a “formater” model that sanitizers the text input and pre-classifies elements. But once the structure object is built, you can do pretty much whatever with it or it’s elements.

I call “element” a more or less semantically delimited data object with: id, path, title, purpose, text content, meta data and children (nested sub-elements)