Using a specific knowledge base with GPT

I have a knowledge base which I want to use to answer queries. The current approach is using my own search method to find top k relevant passages and feeding them to the davinci model to perform QA using a custom prompt like “Answer the question based on the provided context”.

I have gone through the other posts related to this topic and the best option seems like answers endpoint, but it is deprecated now. Another issue is that when the filtered documents are not relevant, the QA part also does not work, which is expected. So what I wanted to ask is:

  1. Are there any open source projects which combine GPT with a knowledge base using a good search component? I have already tried GPTindex and gptanswers-node.
  2. Is there any way we can fine-tune the GPT model to make it memorize the fine-tuning data?

I can try embeddings but I expect to run into similar issues with them, as my custom search is a similar embedding based approach.

If anyone here has run into a similar issue or has been trying to combine GPT with a knowledge base and can share what they found, it would be really helpful.

TLDR - Is there any way to make GPT memorize data (knowledge base) using fine-tuning? Or any other ways to combine knowledge bases with GPT?

9 Likes

Hi, no you can’t make GPT-3 memorize your knowledge base. Do you know why some of your filtered documents are not relevant? For this use case the consensus in the forum is to use embedddings-based search.

1 Like

Hey, Thanks for the answer.
The filtering part is failing only for vague user queries or in those cases where there are multiple passages which are similar to the query, which sometimes results in the expected passage not showing up in the top k retrieved.

I thought maybe I could fine-tune the model to make it absorb the complete knowledge base, like how other LLMs retain knowledge from their pretraining data, but seems like that is not possible with GPT.

1 Like

I still don’t get it then. And I read Different things about that. From image recognition I remember I finetuned a cnn to recognize custome objects.
If I can’t use Finetuned to teach Gpt new stuff - what’s the use case for Finetuning then

It depends on the use case, but here is an approach.

Let’s say that your knowledge base is structured in articles.
An user makes a certain request.

A human will combine information from two or more articles to formulate the correct response.

GPT-3 can do that, but not using your knowledge base. If you fine-tune a model, the response from it may not be the desired one.
The alternative would be to use embeddings, but in this case it is very likely to use a single article to formulate the response, which is not as great as a human response.

There is a workaround, but it can work on limited use cases.

  1. Store your knowledge base on weaviate, in articles.
  2. When the user makes a request, prompt GPT-3 to generate a series of questions which can determine the correct response for the user request.
  3. Take the series of questions and submit all of them to weaviate.
  4. Then prompt GPT-3 to formulate a response for the user by combining the information from the articles found by weaviate.

Anyone here who has tried this approach and has a public example that they can share? What worked? What was hard?

Best Regards,

Anil

1 Like

Hi,

Am doing something very similar so wanted to see where you got to this. Trying GPTIndex and LangChaing.

This post was excellent for the latter

Build a GitHub support bot with GPT3, LangChain, and Python | Dagster Blog

If you see the example, it brings together information from multiple Wikipedia pages to answer a questions and also provides citation links (as told to via the prompt).

2 Likes

Yea, Langchain is the way to go. They released some good stuff with their hackathon last week, this readthedocs app GitHub - hwchase17/chat-langchain is a really good introduction. The ingestion process uses Pathlib and Beautiful Soup to pull the html down using wget, then embeds them into a Weaviate vector database. The gradio front end embeds your query, uses that to return the nearest neighbors from Weaviate. These docs get added to the langchain context and send to Davinci for your answer.

This seems to be the workflow everyone has adopted the last couple weeks, and is similar to what gpt-index does with their simple directory reader, which generates a vector index and does the recursive calls against the nodes.

You can easily adopt either of these workflows to build your corpus and provide context to GPT or other models.