Aggregated answer across multiple documents (Q&A)

kunal0 · August 2, 2021, 8:30pm

I am using Q&A feature of GPT-3 to answer questions from a large pdf file.

Since I cannot upload the whole pdf, I convert them to JSON records with no individual record exceeding 2000 tokens.

While querying the answers, GPT-3 does a semantic search to rank the documents by relevance to the question. But there is a chance I may not find the exact answer because of the way I have split the file in chunks with a maximum size of 2000 tokens.

Is there any possibility to re-search across all the top-ranked documents and improve the final answer?
Is there any plugin to convert a document (pdf, word, etc.) to JSON with chunks of 2000 token (since I imagine many folks will need to work with GPT-3)

To illustrate with an example:

Document -1) Alice is a good Tennis player.

Document -2) Alice started playing Hockey this summer.

Document -3) During the Tokyo Olympics she did Shooting.

Question: What all sports can Alice play?

Expected Answer: Tennis, Hockey, and Shooting.

boris · August 2, 2021, 10:09pm

Very interesting Q. One thing I can think of is to split a document into chunks with overlapping context. For example if each chunk is 2k tokens, it could overlap the first 1k token with the previous chunk, and next 1k token with the next chunk. That way there won’t be many unfortunate document splits.

In case of needing an answer which can combine context from multiple places in the document I’m not sure there’s a good solution. It may help to split the document into smaller chunks, then populate the context with the top 10 chunks of 100 tokens each, thus allowing the answer to be based on multiple parts of the document.

daveshapautomator · August 2, 2021, 10:22pm

There is some research going on about this problem but it’s very much an open topic. I posted some papers here:

kunal0 · August 3, 2021, 7:22pm

Thanks Boris.

Are any of these a possibility?

OPTION 1) Can we generate answer from each of the top 10 chunks. Then combine the top 10 answers and make a new chunk. Then feed that chunk back to GPT-3 and then GPT-3 generates a better aggregated answer from that NEW Chunk?

OPTION 2) Can we not feed the top 10 chunks into an end point and it figures out the right answer looking at all the chunks together. Why do we have this 2k chunk limit, which prevents GPT-3 from getting more accurate answer?

Harish · August 5, 2021, 7:53am

Is there any plugin to convert a document (pdf, word, etc.) to JSON with chunks of 2000 token (since I imagine many folks will need to work with GPT-3)

Check out this Notebook I had put together to help divide text data, like a book, into chunks of certain token size without breaking sentences mid-way

m1 · August 5, 2021, 12:05pm

Did utilize exactly this approach for our first version of alexsystems.ai.
“Sliding” the content, including xxx tokens from both previous & following chunks.

However it did not drastically improve the result as we were expecting. In our case we also match the segmatic hits with the document again, highlighting the content used to make the answer.

We did in the end completely switch the approach to how we parse PDF’s,
Now we are using a vision based approach to identify subject & text blocks +identifying things such as tables & graphs etc, => sending those to other AI models that process it to a GPT-3 chewable structure.

PDF’s beyond the surface is a messy place
My advice to anyone doing something similar is to really figure out how to extract data correctly first, it’s the make it or break it

In case of needing an answer which can combine context from multiple places in the document I’m not sure there’s a good solution. It may help to split the document into smaller chunks, then populate the context with the top 10 chunks of 100 tokens each, thus allowing the answer to be based on multiple parts of the document.

We did try this also and it actually performed better than the “sliding” method, however it does not solve more complex tasks such as financial analysis, graph & table heavy PDF’s etc.

@boris One idea I had, that I didn’t try yet but that theoretically should be doable would be to compress the content by re-formulating it and stripping out all the “fluff” and while keeping the compressed version mapped to full version that lives in some DB ( to solve highlighting ), However am not sure answers engine by default could handle this, with fine-tuning it should be achievable, and technically the answers process can be broken down into its original steps.

What I wrote here above, could probably be written with 25% of the input

Edit: missed “s” in link

pmshadow · March 14, 2023, 5:07am

Hi, I am following the exact same approach trying to parse the pdf document into sections.
I am currently using pdfplumber (not OCR based), but I tried facebook detectron and thought it could help me do it, but in a more difficult way.
Could you please point me to the right direction? What python libraries would you recommend? I am processing financial pdfs with lota of tables, not that many charts.

Thanks

Topic		Replies	Views
Feedback please: Chatbot to answer questions about long documents API	4	2270	December 17, 2023
Sending large document via API call and asking for a question over complete document? Prompting api	3	1849	February 26, 2024
Multiple document answering? API	1	1334	July 18, 2021
Practical Tips for Dealing with Large Documents (>2048 tokens) API	6	8563	December 17, 2023
Retrieval Augmented Generation (RAG) with 100k PDFs?! Too slow! Community pdf , llm , rag , development	13	25684	October 31, 2024

Aggregated answer across multiple documents (Q&A)

Related topics