Use of davinci for question and answer from knowledge documents

Hi, I wanted to validate if my approach to solve this problem is inline with best practice

Problem Statement : I have many different knowledge documents, articles. I want to use GPT models like DAVINCI or CURIE/ADA/BABBAGE to output an answer based on the content of the documents when a user asks a question. I want the model to answer based on the documents only and also would like to spit out the name of the source document from where the answer came from

I have thought of 2 ways to solve it, in the 2nd technique, I do not know how to spit out the source of the document

Solution#1 - Feed the content of the documents to the GPT model and prompt it to extract a set of questions and answers from the document. Use that output to create a training set as below. In the completion, add the source name also(in the below, I added WIKI)

{“prompt”:“What information will Form AME95 include?Please also mention the source of information. +++++”,“completion”:" Form AME95 will include your name and the name of your large employer, the months during the prior calendar year when you were eligible for coverage, and the cost of the cheapest monthly premium you could have paid for coverage under your employer’s health plan. If you worked for a large employer that did not offer its full time employees health coverage, Form 1095-C will also indicate"}

Use this set of prompt/completion to train the model, when the completion response is output, strip the source name(wiki in this case) programmatically

Solution#2 - Use embedding to identify similar documents that can have the answer to question. Add those documents as additional context to the question prompt. This will ensure that the question is answered from the documents passed as additional context only. Here I do not know how to add the “source of the answer” as a metadata


Solution #2 is usually the best method to do what you’re trying to do.

You can store the Embeddings vectors in an object that also contains the links to the original source.


{embeddings: [
    vector: VECTOR array,
    link: LINK string

If you use a vector database then you can set the link path as a property in the ‘meta’ for each vector.

1 Like

To add, there are lots of documentation to do exactly this.


  1. Store association between original file → file textblock/para, generated embedding in e.g dataframe

What techniques are usually used to extract tabular details from PDFs to create the embeddings. Some of the open source components are not good at extracting the tables from pdfs

you can try Table Transformer

I’ve had some success with CustomGPT using these two approaches.

  1. Embed (within the PDF document) Markdown tables. I know this seems bizarre, but it works to some extent to make table data more findable in chatting contexts.
  2. Build documents in Markdown with Markdown tables (CustomGPT can parse PDFs and MDs and many other formats).

CustomGPT seems to be able to create useful embeddings for table data, but I know they don’t make any performance claims about this. It might be worth a try though - just create a free trial and give it a go.

I have the same use case. I didn’t have satisfactory results using the Solution#1 approach. Ended up going with a Solution#2 approach.

So in the 2nd approach, did you convert the PDF into a MD and then created the embeddings out of it. The scenario I am trying to address is below. If the table looks like

Name             Address
--------        -----------
JOE              NEW YORK

How do I extract this information and format it for GPT so that it knows JOE is from NEW YORK and not CALIFORNIA

This is really two very different tasks.

  1. Extract this information and format it for GPT - to be clear, you aren’t formatting it for GPT; you’re doing it for whatever embedding approach you may choose (like CustomGPT) or your own embedding process. If the source is PDF, you need to transform into text with table data in Markdown format (note that Markdown tables are not like the one you’ve shared).

  2. So that it knows JOE is from NEW YORK and not CALIFORNIA - This will require some experimentation. I recommend you create a markdown document with a table of a few dozen examples, get a free trial at CustomGPT, and see how well it performs. All of this can be explored in an hour’s effort, and this will provide some insights on how to further your push to make this work in production.

Just tried this out using davinci and embedding-ada. I do not need to figure out the table in PDF. just read the PDF in text, converted to embeddings and added to pinecone. Later used RAG to retrieve the context and send to davinci. It answered perfectly.


What is the solution of 1st technique.Am cannot to resolve the outside of content.