I am using Q&A feature of GPT-3 to answer questions from a large pdf file.
Since I cannot upload the whole pdf, I convert them to JSON records with no individual record exceeding 2000 tokens.
While querying the answers, GPT-3 does a semantic search to rank the documents by relevance to the question. But there is a chance I may not find the exact answer because of the way I have split the file in chunks with a maximum size of 2000 tokens.
Is there any possibility to re-search across all the top-ranked documents and improve the final answer?
Is there any plugin to convert a document (pdf, word, etc.) to JSON with chunks of 2000 token (since I imagine many folks will need to work with GPT-3)
To illustrate with an example:
Document -1) Alice is a good Tennis player.
Document -2) Alice started playing Hockey this summer.
Document -3) During the Tokyo Olympics she did Shooting.
Very interesting Q. One thing I can think of is to split a document into chunks with overlapping context. For example if each chunk is 2k tokens, it could overlap the first 1k token with the previous chunk, and next 1k token with the next chunk. That way there won’t be many unfortunate document splits.
In case of needing an answer which can combine context from multiple places in the document I’m not sure there’s a good solution. It may help to split the document into smaller chunks, then populate the context with the top 10 chunks of 100 tokens each, thus allowing the answer to be based on multiple parts of the document.
OPTION 1) Can we generate answer from each of the top 10 chunks. Then combine the top 10 answers and make a new chunk. Then feed that chunk back to GPT-3 and then GPT-3 generates a better aggregated answer from that NEW Chunk?
OPTION 2) Can we not feed the top 10 chunks into an end point and it figures out the right answer looking at all the chunks together. Why do we have this 2k chunk limit, which prevents GPT-3 from getting more accurate answer?
Is there any plugin to convert a document (pdf, word, etc.) to JSON with chunks of 2000 token (since I imagine many folks will need to work with GPT-3)
Check out this Notebook I had put together to help divide text data, like a book, into chunks of certain token size without breaking sentences mid-way
Did utilize exactly this approach for our first version of alexsystems.ai.
“Sliding” the content, including xxx tokens from both previous & following chunks.
However it did not drastically improve the result as we were expecting. In our case we also match the segmatic hits with the document again, highlighting the content used to make the answer.
We did in the end completely switch the approach to how we parse PDF’s,
Now we are using a vision based approach to identify subject & text blocks +identifying things such as tables & graphs etc, => sending those to other AI models that process it to a GPT-3 chewable structure.
PDF’s beyond the surface is a messy place My advice to anyone doing something similar is to really figure out how to extract data correctly first, it’s the make it or break it
In case of needing an answer which can combine context from multiple places in the document I’m not sure there’s a good solution. It may help to split the document into smaller chunks, then populate the context with the top 10 chunks of 100 tokens each, thus allowing the answer to be based on multiple parts of the document.
We did try this also and it actually performed better than the “sliding” method, however it does not solve more complex tasks such as financial analysis, graph & table heavy PDF’s etc.
@boris One idea I had, that I didn’t try yet but that theoretically should be doable would be to compress the content by re-formulating it and stripping out all the “fluff” and while keeping the compressed version mapped to full version that lives in some DB ( to solve highlighting ), However am not sure answers engine by default could handle this, with fine-tuning it should be achievable, and technically the answers process can be broken down into its original steps.
What I wrote here above, could probably be written with 25% of the input
Hi, I am following the exact same approach trying to parse the pdf document into sections.
I am currently using pdfplumber (not OCR based), but I tried facebook detectron and thought it could help me do it, but in a more difficult way.
Could you please point me to the right direction? What python libraries would you recommend? I am processing financial pdfs with lota of tables, not that many charts.