Feedback please: Chatbot to answer questions about long documents

Context

Chatbot that answer questions based on hundreds of long documents.
ANY material to read/watch is greatly appreciated, just put me in the right path and I can study :slight_smile:

Current strategy

  1. Downloaded pdf files
  2. Extracted text from pdf files
  3. Text from PDF → Embeddings (“document-embedding”)
  4. User makes a question to chatbot → Embedding (“query-embedding”)
  5. “Query-embedding” compared to “document-embedding” → Get appropriate document
  6. user-question + prompt + document sent to OpenAI.
  7. Happiness achieved

My Questions: help please :slight_smile:

Suppose a 500 page document

Question 1: Should I split the long document? What is the best way to do so?

My current strategy:
500 page → divide into 8k-token-long subparts → create a 8k-Embedding → split 8k-subpart into 1k-token-long part → create 1k-embeddings.

Then, User-query-embedding is compared first to 8k-embedding to select a “section” and then to 1k-long subpart. The user-query-text + prompt + 1k long subpart is sent to openai.

Optionally, I could create the 1k-embeddings only after the user queries, to avoid double-spending processing documents that will never be queried.

Question 2: How can I “concatenate” several 1k-long parts to send to OpenAI? Is this what Langchain does? What is your strategy and a prompt example? Anything to read/watch is greatly appreciated!

Example:
1k-long-part sent to openAI for “summarization”, response is received.
“response” + “another 1k-long-part” sent again for summarization and so on.

Question 3: what is a good prompt to use the information from the document as the “ground truth”. I plan to use a very low temperature.

Question 4: To process the user-question + prompt + document, Would you use untrained gpt-3.5, trained gpt 3.0 curie, or trained gpt 3.0 davinci?

Question 5: Besides manually checking for the answers that I am getting, is there a “test” that I can perform to assess answer-quality?

Many thanks! Any answer is greatly appreciated! Just point me to the right channels, videos, material to read etc

5 Likes

Hey @pmshadow, you came up with solution? I’m really interested to know how you done it.

1 Like

I found this post did great job on pdf , pls google chat-with-pdfs-using-chatgpt-and-openai-gpt-ap nanonet

1 Like

do you have a specific question regarding it? in the end I split with smaller parts like 300 token long with 100 token overlap.