Optimized way to approach this problem

I have to create a Web App where a user uploads two documents, one is a standard document containing guidelines etc (Like the ISO), and one is their own document. The AI is compare the two and tell which guidelines is our document missing/complying which are in our document. The one way we decided to come up was divide the pdf into each page and run a pre set prompt on each page and get a list of bullet points which we concatenate in the end. Then we do the same for our document and in the end compare the resulting points(guidelines). However, I’m afraid we may hit API rate limits. What is the best way to approach this?

1 Like

are the guidelines docs be reused? what i mean is will it be like some template that you keep on using as reference? if so, you can put the guideline processing in batch api.

No the guidelines will be dynamic, each user will have his/her own set of guidelines. That’s what causing the problem, if the guidelines were hard set I could have used pattern matching to extract the guidelines from it

What we are thinking about is dividing the document into single page, sending a pre formatted prompt along with the page text to extract the guidelines in a specified format. We do that for the entire pdf and then concatenate the results. The same for our personal pdf. We then compare the results using the LLM. The problem here is we are bound to reach API limit rates given that we can expect the documents to be 10+ pages long

Another solution here is we divide both documents into chunks, calculate embeddings and then compare the chunks, however the problem here is the chunks may contain irrelevant content such as descriptions, headings , introductions , (you get the gist). So if we could somehow pre-process the
document to take out the irrelevant parts without hitting the rate limit we should be good to go ?