Hi, I wanted to validate if my approach to solve this problem is inline with best practice
Problem Statement : I have many different knowledge documents, articles. I want to use GPT models like DAVINCI or CURIE/ADA/BABBAGE to output an answer based on the content of the documents when a user asks a question. I want the model to answer based on the documents only and also would like to spit out the name of the source document from where the answer came from
I have thought of 2 ways to solve it, in the 2nd technique, I do not know how to spit out the source of the document
Solution#1 - Feed the content of the documents to the GPT model and prompt it to extract a set of questions and answers from the document. Use that output to create a training set as below. In the completion, add the source name also(in the below, I added WIKI)
{âpromptâ:âWhat information will Form AME95 include?Please also mention the source of information. +++++â,âcompletionâ:" Form AME95 will include your name and the name of your large employer, the months during the prior calendar year when you were eligible for coverage, and the cost of the cheapest monthly premium you could have paid for coverage under your employerâs health plan. If you worked for a large employer that did not offer its full time employees health coverage, Form 1095-C will also indicate that.wiki.#####"}
Use this set of prompt/completion to train the model, when the completion response is output, strip the source name(wiki in this case) programmatically
Solution#2 - Use embedding to identify similar documents that can have the answer to question. Add those documents as additional context to the question prompt. This will ensure that the question is answered from the documents passed as additional context only. Here I do not know how to add the âsource of the answerâ as a metadata
What techniques are usually used to extract tabular details from PDFs to create the embeddings. Some of the open source components are not good at extracting the tables from pdfs
So in the 2nd approach, did you convert the PDF into a MD and then created the embeddings out of it. The scenario I am trying to address is below. If the table looks like
Name Address
-------- -----------
JOE NEW YORK
JOHN CALIFORNIA
How do I extract this information and format it for GPT so that it knows JOE is from NEW YORK and not CALIFORNIA
Extract this information and format it for GPT - to be clear, you arenât formatting it for GPT; youâre doing it for whatever embedding approach you may choose (like CustomGPT) or your own embedding process. If the source is PDF, you need to transform into text with table data in Markdown format (note that Markdown tables are not like the one youâve shared).
So that it knows JOE is from NEW YORK and not CALIFORNIA - This will require some experimentation. I recommend you create a markdown document with a table of a few dozen examples, get a free trial at CustomGPT, and see how well it performs. All of this can be explored in an hourâs effort, and this will provide some insights on how to further your push to make this work in production.
Just tried this out using davinci and embedding-ada. I do not need to figure out the table in PDF. just read the PDF in text, converted to embeddings and added to pinecone. Later used RAG to retrieve the context and send to davinci. It answered perfectly.