Context
Chatbot that answer questions based on hundreds of long documents.
ANY material to read/watch is greatly appreciated, just put me in the right path and I can study
Current strategy
- Downloaded pdf files
- Extracted text from pdf files
- Text from PDF → Embeddings (“document-embedding”)
- User makes a question to chatbot → Embedding (“query-embedding”)
- “Query-embedding” compared to “document-embedding” → Get appropriate document
- user-question + prompt + document sent to OpenAI.
- Happiness achieved
My Questions: help please
Suppose a 500 page document
Question 1: Should I split the long document? What is the best way to do so?
My current strategy:
500 page → divide into 8k-token-long subparts → create a 8k-Embedding → split 8k-subpart into 1k-token-long part → create 1k-embeddings.
Then, User-query-embedding is compared first to 8k-embedding to select a “section” and then to 1k-long subpart. The user-query-text + prompt + 1k long subpart is sent to openai.
Optionally, I could create the 1k-embeddings only after the user queries, to avoid double-spending processing documents that will never be queried.
Question 2: How can I “concatenate” several 1k-long parts to send to OpenAI? Is this what Langchain does? What is your strategy and a prompt example? Anything to read/watch is greatly appreciated!
Example:
1k-long-part sent to openAI for “summarization”, response is received.
“response” + “another 1k-long-part” sent again for summarization and so on.