Using a specific knowledge base with GPT

harsh.shah · December 15, 2022, 10:13am

I have a knowledge base which I want to use to answer queries. The current approach is using my own search method to find top k relevant passages and feeding them to the davinci model to perform QA using a custom prompt like “Answer the question based on the provided context”.

I have gone through the other posts related to this topic and the best option seems like answers endpoint, but it is deprecated now. Another issue is that when the filtered documents are not relevant, the QA part also does not work, which is expected. So what I wanted to ask is:

Are there any open source projects which combine GPT with a knowledge base using a good search component? I have already tried GPTindex and gptanswers-node.
Is there any way we can fine-tune the GPT model to make it memorize the fine-tuning data?

I can try embeddings but I expect to run into similar issues with them, as my custom search is a similar embedding based approach.

If anyone here has run into a similar issue or has been trying to combine GPT with a knowledge base and can share what they found, it would be really helpful.

TLDR - Is there any way to make GPT memorize data (knowledge base) using fine-tuning? Or any other ways to combine knowledge bases with GPT?

lmccallum · December 15, 2022, 4:18pm

Hi, no you can’t make GPT-3 memorize your knowledge base. Do you know why some of your filtered documents are not relevant? For this use case the consensus in the forum is to use embedddings-based search.

harsh.shah · December 16, 2022, 6:52am

Hey, Thanks for the answer.
The filtering part is failing only for vague user queries or in those cases where there are multiple passages which are similar to the query, which sometimes results in the expected passage not showing up in the top k retrieved.

I thought maybe I could fine-tune the model to make it absorb the complete knowledge base, like how other LLMs retain knowledge from their pretraining data, but seems like that is not possible with GPT.

heiko · December 16, 2022, 7:52am

I still don’t get it then. And I read Different things about that. From image recognition I remember I finetuned a cnn to recognize custome objects.
If I can’t use Finetuned to teach Gpt new stuff - what’s the use case for Finetuning then

georgei · December 16, 2022, 8:01am

It depends on the use case, but here is an approach.

Let’s say that your knowledge base is structured in articles.
An user makes a certain request.

A human will combine information from two or more articles to formulate the correct response.

GPT-3 can do that, but not using your knowledge base. If you fine-tune a model, the response from it may not be the desired one.
The alternative would be to use embeddings, but in this case it is very likely to use a single article to formulate the response, which is not as great as a human response.

There is a workaround, but it can work on limited use cases.

Store your knowledge base on weaviate, in articles.
When the user makes a request, prompt GPT-3 to generate a series of questions which can determine the correct response for the user request.
Take the series of questions and submit all of them to weaviate.
Then prompt GPT-3 to formulate a response for the user by combining the information from the articles found by weaviate.

anil.nair · December 19, 2022, 6:41am

Anyone here who has tried this approach and has a public example that they can share? What worked? What was hard?

Best Regards,

Anil

monkeydust · January 26, 2023, 3:47pm

Hi,

Am doing something very similar so wanted to see where you got to this. Trying GPTIndex and LangChaing.

This post was excellent for the latter

Build a GitHub support bot with GPT3, LangChain, and Python | Dagster Blog

If you see the example, it brings together information from multiple Wikipedia pages to answer a questions and also provides citation links (as told to via the prompt).

dahifi · February 2, 2023, 1:52pm

Yea, Langchain is the way to go. They released some good stuff with their hackathon last week, this readthedocs app GitHub - hwchase17/chat-langchain is a really good introduction. The ingestion process uses Pathlib and Beautiful Soup to pull the html down using wget, then embeds them into a Weaviate vector database. The gradio front end embeds your query, uses that to return the nearest neighbors from Weaviate. These docs get added to the langchain context and send to Davinci for your answer.

This seems to be the workflow everyone has adopted the last couple weeks, and is similar to what gpt-index does with their simple directory reader, which generates a vector index and does the recursive calls against the nodes.

You can easily adopt either of these workflows to build your corpus and provide context to GPT or other models.

smartleo · April 5, 2023, 6:45pm

I did try Langchain, but it doesn’t work as I expected. I created embeddings and stored them in Pinecone’s vector database. In my case, I stored the story of Cinderella just for testing. Through the API, I was able to get answers solely related to the story. However, if I asked questions that are beyond the scope of the story, such as “Who is the author of Cinderella?” I got “I don’t know”. If I ask the same question in regular web interface, I will get detailed answer.

Is there a way to expand ChatGPT’s knowledge base with my own dataset, so that ChatGPT will look into my own dataset first and then generate answers based on that and its trained data ?

Will the Retrieval Plugin solve this problem? Thanks!

curt.kennedy · April 5, 2023, 7:17pm

Why not just do this:

“Who is the author of Cinderella?”

→ “I don’t know”

→ Detect the “I don’t know” response, if detected, then ask GPT “Who is the author of Cinderella?” without the context of your embeddings. Otherwise return previous response.

→ “The author of Cinderella is unknown as it is a folk tale that has been passed down through generations and adapted by various cultures.”

Then return the last response.

So it follows a pattern of Input → Response → Switch Model or Context if flagged by detector … rinse/repeat

smartleo · April 5, 2023, 8:12pm

@curt.kennedy Thank you for your suggestion. Here is another issue I encountered. For example, I provided ChatGPT Cinderella’s story and I want it write a sequel based on that. So I asked this “Can you please write a sequel of Cinderella?” and I got “No, I don’t know how to write a sequel of Cinderella.”

If I switch model/context, then ChatGPT will not be able to come up with an answer based on what I provided. And the key of this exercise it to expand ChatGPT’s knowledge base. In this simple case, since everyone knows Cinderella’s story, even out of context, ChatGPT can still generate a good response.

curt.kennedy · April 5, 2023, 8:29pm

What are your settings? Here I use GPT-3.5, temperature = 1 and max_tokens = 2048 in the Playground and get:

GPT-3.5 answer to “Can you please write a sequel of Cinderella?”

Once upon a time, Cinderella and Prince Charming lived happily ever after in their magical kingdom. They had three beautiful children, and life was perfect. However, one day, a powerful sorcerer threatened to destroy their happy family.

The sorcerer, named Malachi, was furious that he had not been invited to the royal wedding. He had spent years honing his dark magic, and he was determined to get revenge. Malachi cast a powerful spell on the kingdom, causing chaos and destruction wherever he went.

Cinderella and Prince Charming knew they had to act quickly to stop Malachi. They consulted with the wisest sorcerers in the land and learned that they needed to find a powerful talisman to defeat him. The talisman was hidden deep in a dangerous forest, but they were willing to do whatever it took to save their family and their kingdom.

Cinderella and Prince Charming set out on a treacherous journey through the forest, fighting off monsters and dodging deadly traps. Finally, after many days of travel, they arrived at the talisman’s resting place. They were confronted by a giant dragon guarding the talisman, but Prince Charming, armed with his trusty sword, was able to defeat the beast.

With the talisman in hand, Cinderella and Prince Charming returned to their kingdom, ready to face Malachi. They engaged in an epic battle of magic and strength, but ultimately, Cinderella and Prince Charming’s love proved stronger than Malachi’s dark magic. They were able to defeat the sorcerer and restore peace to the land.

And so, Cinderella and Prince Charming lived happily ever after once again, but they knew that they would always be ready to face any challenge that came their way. They had proven that true love and courage could overcome even the darkest of magic.

GPT-4 answer to “Can you please write a sequel of Cinderella?”

Title: Cinderella: Dreams of Enchantment

Once upon a time, in the enchanted kingdom of Castalia far beyond the forest of dreams, a young queen named Cinderella ruled with her dashing prince by her side. Their love story was one the entire kingdom continued to cherish, bringing hope and happiness to all who knew it. Little did they know that a new chapter in their lives was about to be written.

Queen Cinderella and Prince Leo had been ruling the kingdom for five years now. The Kingdom of Castalia prevailed as a beacon of harmony and magic, with their beloved subjects growing even dearer to their hearts. As the years passed, Cinderella felt an emptiness that she longed to fill. She wished desperately for a child to share the love she had with her prince.

One day, Cinderella stumbled upon an enchanted rose garden concealed within the castle walls. As the petals cascaded through the air around her, she felt a flutter of hope that this magical place could grant her heart’s deepest desire. By the glow of the setting sun, Cinderella carefully whispered her wish for a child into the wind, allowing the petals to carry her dream far as they danced to the tune of destiny.

Months later, royal trumpets of joy resonated through the kingdom as the news spread that Queen Cinderella was with child. The kingdom celebrated this new miracle, and the anticipation of the arrival of the royal heir brought tremendous excitement to the lives of the people.

The day finally arrived, and a beautiful girl was born unto the world, her eyes as bright as sapphires, and her smile as warm and kind as her mother’s. They named her Princess Nerida, and her presence filled the palace with a renewed sense of wonder and love. She was a child of magic, a symbol of the dreams that the Enchanted Kingdom of Castalia continued to thrive under the rule of Cinderella and Prince Leo.

As Nerida grew, she was loved by all in the kingdom, especially her doting parents. Her days were filled with the magic that flowed within her like a refreshing stream of happiness.

One morning, when Princess Nerida stepped into her teenage years, the magical stream willow approached her. She whispered in the wind that she sensed a powerful enchantment within the heart of the princess. Surprised and excited, Nerida secretly began learning about the ancient lore of enchanted enchantments.

As the years passed, Princess Nerida grew more confident with her magical abilities, discovering that she could feel the dreams and desires of others. Guided by her kind nature, she set her heart on sharing her enchantments and spreading happiness just like her mother once longed for her own dreams to be realized.

The turmoil of darkness and despair reigning in a neighboring kingdom called Daramis came to her attention. The dreams of the people of Daramis had been devoured by despair, and the power of darkness had captured their hearts. A once thriving kingdom was now teetering on the edge of collapse, and Princess Nerida was determined to help.

As the day of her journey to Daramis drew near, Queen Cinderella and Prince Leo approached their daughter, clasping stemmed glasses of magical moonbeam juice, sparkling with golden stardust. Their hearts swelled with pride as they entrusted their daughter with the liquid enchantment, blessed with the power to help revive the dreams of the people of Daramis.

Upon reaching the Kingdom of Daramis, Princess Nerida discovered a land where hope was forgotten, and happiness had become a myth. With her heart heavy, Nerida set forth on her mission to heal the broken kingdom. Under her gentle touch, her enchantments flowed from the deepest parts of her heart, while the magical moonbeam juice began to seep into the soil, revitalizing the ravaged land.

Soon enough, the people of Daramis began to feel their dreams sprouting within them, and they slowly opened the doors to their hearts to let hope in once more. Together with Princess Nerida and her love, laughter, and enchantments, they built a beautiful future—a land where dreams were no longer just a whisper in the wind, but an everlasting legacy of joy and courage.

On her return to Castalia, Nerida was warmly embraced by her parents, who already saw her no longer as their little girl, but as a true Enchantress of Dreams—a queen who would one day continue her mother’s legacy of sharing love and granting wishes to all those in need.

And so the tale of Cinderella lived on – as dreams turned into enchantments, reaching people across the lands, stitched together with the love that only a mother, a queen, and a dreamer could give – enchanting the world, forever and always.

The End.

So when answering from embeddings, go with temperature of 0 (or low), but when the detector is tripped, you go with temperature of 1 (or even 2 on some models) and let her rip!. Be sure to drop all previous history when you need to go off-topic (off-embedding), since it isn’t likely relevant and should help the bot from choking up.

The notion of using search is obsolete in this context, search doesn’t have a better response than GPT-X.

You may find another related thread I commented on helpful. If you use GPT-3.5, it will “leak” out information and go off-script. This isn’t bad if you want it to be simultaneously content aware and have general knowledge at the same time:

Topic		Replies	Views
What to do when fine-tuning is not working? API	21	8138	December 24, 2023
How to provide "context" in a Q&A chatbot Prompting	12	11901	December 20, 2023
Fine tuning vs. Embedding API	21	46316	December 12, 2023
Best method of injecting relatively large amount of context to be leveraged in a response API	10	12330	December 17, 2023
Embeddings not preventing OpenAI from answering API	25	3131	December 19, 2023

Using a specific knowledge base with GPT

Related topics