Optimized way to approach this problem

mmmasood.bscs22seecs · September 3, 2024, 8:25pm

I have to create a Web App where a user uploads two documents, one is a standard document containing guidelines etc (Like the ISO), and one is their own document. The AI is compare the two and tell which guidelines is our document missing/complying which are in our document. The one way we decided to come up was divide the pdf into each page and run a pre set prompt on each page and get a list of bullet points which we concatenate in the end. Then we do the same for our document and in the end compare the resulting points(guidelines). However, I’m afraid we may hit API rate limits. What is the best way to approach this?

supershaneski · September 3, 2024, 11:18pm

are the guidelines docs be reused? what i mean is will it be like some template that you keep on using as reference? if so, you can put the guideline processing in batch api.

mmmasood.bscs22seecs · September 4, 2024, 6:01am

No the guidelines will be dynamic, each user will have his/her own set of guidelines. That’s what causing the problem, if the guidelines were hard set I could have used pattern matching to extract the guidelines from it

mmmasood.bscs22seecs · September 4, 2024, 6:10am

What we are thinking about is dividing the document into single page, sending a pre formatted prompt along with the page text to extract the guidelines in a specified format. We do that for the entire pdf and then concatenate the results. The same for our personal pdf. We then compare the results using the LLM. The problem here is we are bound to reach API limit rates given that we can expect the documents to be 10+ pages long

mmmasood.bscs22seecs · September 4, 2024, 6:16am

Another solution here is we divide both documents into chunks, calculate embeddings and then compare the chunks, however the problem here is the chunks may contain irrelevant content such as descriptions, headings , introductions , (you get the gist). So if we could somehow pre-process the
document to take out the irrelevant parts without hitting the rate limit we should be good to go ?

Topic		Replies	Views
Which is the right Assistants tool(s) for my use case? API api , assistants-api	3	488	September 3, 2024
Extracting insights from multiple documents API	4	1964	December 17, 2023
Optimizing AI Document Retrieval: Embedding vs. Prompting API embeddings , gpt-4	2	1378	January 31, 2024
Is there any way by which I can let GPT-4 API summarize large PDF texts? API gpt-4 , api	10	8092	May 6, 2024
Feedback please: Chatbot to answer questions about long documents API	4	2103	December 17, 2023

Optimized way to approach this problem

Related Topics