Use of davinci for question and answer from knowledge documents

rajibdeb76 · March 8, 2023, 4:35pm

Hi, I wanted to validate if my approach to solve this problem is inline with best practice

Problem Statement : I have many different knowledge documents, articles. I want to use GPT models like DAVINCI or CURIE/ADA/BABBAGE to output an answer based on the content of the documents when a user asks a question. I want the model to answer based on the documents only and also would like to spit out the name of the source document from where the answer came from

I have thought of 2 ways to solve it, in the 2nd technique, I do not know how to spit out the source of the document

Solution#1 - Feed the content of the documents to the GPT model and prompt it to extract a set of questions and answers from the document. Use that output to create a training set as below. In the completion, add the source name also(in the below, I added WIKI)

{“prompt”:“What information will Form AME95 include?Please also mention the source of information. +++++”,“completion”:" Form AME95 will include your name and the name of your large employer, the months during the prior calendar year when you were eligible for coverage, and the cost of the cheapest monthly premium you could have paid for coverage under your employer’s health plan. If you worked for a large employer that did not offer its full time employees health coverage, Form 1095-C will also indicate that.wiki.#####"}

Use this set of prompt/completion to train the model, when the completion response is output, strip the source name(wiki in this case) programmatically

Solution#2 - Use embedding to identify similar documents that can have the answer to question. Add those documents as additional context to the question prompt. This will ensure that the question is answered from the documents passed as additional context only. Here I do not know how to add the “source of the answer” as a metadata

Thanks

wfhbrian · March 8, 2023, 9:38pm

Solution #2 is usually the best method to do what you’re trying to do.

You can store the Embeddings vectors in an object that also contains the links to the original source.

Example:

{embeddings: [
  {
    vector: VECTOR array,
    link: LINK string
  },
  ...
]}

If you use a vector database then you can set the link path as a property in the ‘meta’ for each vector.

anon10827405 · March 8, 2023, 9:44pm

To add, there are lots of documentation to do exactly this.

atulloona14 · March 28, 2023, 6:46pm

Store association between original file → file textblock/para, generated embedding in e.g dataframe

joyasree78 · March 28, 2023, 7:33pm

What techniques are usually used to extract tabular details from PDFs to create the embeddings. Some of the open source components are not good at extracting the tables from pdfs

atulloona14 · March 29, 2023, 1:00pm

you can try Table Transformer

mbeddo · March 29, 2023, 2:13pm

I have the same use case. I didn’t have satisfactory results using the Solution#1 approach. Ended up going with a Solution#2 approach.

joyasree78 · March 29, 2023, 8:36pm

So in the 2nd approach, did you convert the PDF into a MD and then created the embeddings out of it. The scenario I am trying to address is below. If the table looks like

Name             Address
--------        -----------
JOE              NEW YORK
JOHN             CALIFORNIA

How do I extract this information and format it for GPT so that it knows JOE is from NEW YORK and not CALIFORNIA

bill.french · March 30, 2023, 2:02pm

This is really two very different tasks.

Extract this information and format it for GPT - to be clear, you aren’t formatting it for GPT; you’re doing it for whatever embedding approach you may choose (like CustomGPT) or your own embedding process. If the source is PDF, you need to transform into text with table data in Markdown format (note that Markdown tables are not like the one you’ve shared).
So that it knows JOE is from NEW YORK and not CALIFORNIA - This will require some experimentation. I recommend you create a markdown document with a table of a few dozen examples, get a free trial at CustomGPT, and see how well it performs. All of this can be explored in an hour’s effort, and this will provide some insights on how to further your push to make this work in production.

joyasree78 · March 31, 2023, 4:46am

Just tried this out using davinci and embedding-ada. I do not need to figure out the table in PDF. just read the PDF in text, converted to embeddings and added to pinecone. Later used RAG to retrieve the context and send to davinci. It answered perfectly.

dxbbala355 · July 24, 2023, 6:31am

What is the solution of 1st technique.Am cannot to resolve the outside of content.

Topic		Replies	Views
How to feed data for completions, instead of using prompt/answer fine-tuning format? API	25	17674	December 17, 2023
Embedding and searching from similar embeddings API	6	6591	October 27, 2023
What's better for the type of chatbot I am building? Fine tune or embedding? Community chatgpt , api	10	2198	August 20, 2023
Use file with text-davinci-001 to increase tokens in prompt Prompting	13	2564	December 15, 2023
How to search/answer with formatted documents on large knowledgebases Prompting	8	2555	May 15, 2023

Use of davinci for question and answer from knowledge documents

Related topics