Feedback please: Chatbot to answer questions about long documents

pmshadow · March 12, 2023, 4:43pm

Context

Chatbot that answer questions based on hundreds of long documents.
ANY material to read/watch is greatly appreciated, just put me in the right path and I can study

Current strategy

Downloaded pdf files
Extracted text from pdf files
Text from PDF → Embeddings (“document-embedding”)
User makes a question to chatbot → Embedding (“query-embedding”)
“Query-embedding” compared to “document-embedding” → Get appropriate document
user-question + prompt + document sent to OpenAI.
Happiness achieved

My Questions: help please

Suppose a 500 page document

Question 1: Should I split the long document? What is the best way to do so?

My current strategy:
500 page → divide into 8k-token-long subparts → create a 8k-Embedding → split 8k-subpart into 1k-token-long part → create 1k-embeddings.

Then, User-query-embedding is compared first to 8k-embedding to select a “section” and then to 1k-long subpart. The user-query-text + prompt + 1k long subpart is sent to openai.

Optionally, I could create the 1k-embeddings only after the user queries, to avoid double-spending processing documents that will never be queried.

Question 2: How can I “concatenate” several 1k-long parts to send to OpenAI? Is this what Langchain does? What is your strategy and a prompt example? Anything to read/watch is greatly appreciated!

Example:
1k-long-part sent to openAI for “summarization”, response is received.
“response” + “another 1k-long-part” sent again for summarization and so on.

Question 3: what is a good prompt to use the information from the document as the “ground truth”. I plan to use a very low temperature.

Question 4: To process the user-question + prompt + document, Would you use untrained gpt-3.5, trained gpt 3.0 curie, or trained gpt 3.0 davinci?

Question 5: Besides manually checking for the answers that I am getting, is there a “test” that I can perform to assess answer-quality?

Many thanks! Any answer is greatly appreciated! Just point me to the right channels, videos, material to read etc

adithiya.gopal · July 26, 2023, 6:47am

Hey @pmshadow, you came up with solution? I’m really interested to know how you done it.

haymingway · August 3, 2023, 7:00am

I found this post did great job on pdf , pls google chat-with-pdfs-using-chatgpt-and-openai-gpt-ap nanonet

pmshadow · August 5, 2023, 3:40pm

do you have a specific question regarding it? in the end I split with smaller parts like 300 token long with 100 token overlap.

Topic		Replies	Views
Efficiently Interacting with super super Long PDFs/documents API gpt-4	2	1460	June 25, 2024
Aggregated answer across multiple documents (Q&A) API	6	3368	March 14, 2023
Making a chatbot that answers questions from a book API api	3	5103	December 15, 2023
Use case: asking questions about a specific document API	7	2397	June 12, 2023
Using large PDFs to make a ChatBot API chatgpt , api	21	6485	December 15, 2023