Creating AI Based Document Splitter

Hello Community,

I have encountered a dilemma and would greatly appreciate your insights. I am currently developing a document analyzer and have made good progress, but I am facing a challenge with a specific feature. For the past two weeks, I have been struggling to define the boundaries of individual documents within a combined PDF, which includes contracts, emails, reports, and more. My goal is to create a script that utilizes the OpenAI API to split these documents into sub-documents. However, I am finding it difficult to accurately define each document boundary using the GPT-4o API, especially since I need to pass the pages as images to the API due to various reasons.

My current approach involves passing the pages one by one to the API until it outputs a JSON body with a “result: true,” indicating that I should proceed with splitting the document. At this point, I create a new document containing all the previous pages. While this approach has taken me halfway, I am still encountering inconsistent results. I would greatly appreciate any advice or suggestions you may have on this matter.

P.S. of course I tried with the tempreture=0 and top_p=100 and other various solutions that exist on the internet but didn’t work for me

Thank you in advance for your help!

Hi, I had a similar issue s while ago, solved it finally. See some examples (check the structure.json for the document structure ready to be split into subsections) here: SIMANTIKS API Examples - Business Associate Agreement (fake personal data used in this example). · GitHub

Let me know if that’s what you’re trying to achieve.

Hi Sergeliatko,

Thanks for your response. Sadly this is not what I’m looking for since I’m passing images to the API and I don’t want to make it a text in order to split the original page later.

1 Like

Ok, thanks for letting me know.