I’m trying to train a chatbot with domain-specific knowledge (in particular real estate in Switzerland). I created a chatbot, which I feed some information based on a PDF and then I’m running a chatbot with memory function. It works pretty well, in multiple languages even. So I was curious if the knowledge of the chatbot is limited to only the custom knowledge, or if it has some pre-trained knowledge from the model. I first asked some domain specific questions (in English), which were all answered correctly. Then I asked some general knowledge, where the chatbot answered “I don’t know”. So I concluded there is no “outside” knowledge. Then I randomly asked the same question in German (“what’s the capital of Switzerland?”), and suddenly it knew the correct answer.
- Is this normal behaviour or is this some kind of bug?
- Is there a way I can tell the chatbot to focus only on the custom knowledge/to include pre-trained general knowledge?
I couldn’t find anything related to this in the LangChain documentation.
Here the code I’m using:
import os
import pandas as pd
import matplotlib.pyplot as plt
from transformers import GPT2TokenizerFast
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
from langchain.chains import ConversationalRetrievalChain
from IPython.display import display
import ipywidgets as widgets
os.environ["OPENAI_API_KEY"] = "..."
# STEP 1: Split by chunk
# Convert PDF to text
import textract
doc = textract.process("./Allgemeine Bedingungen.pdf")
# Save to .txt and reopen
with open('Allgemeine Bedingungen.txt', 'w') as f:
with open('Allgemeine Bedingungen.txt', 'r') as f:
text = f.read()
# Create function to count tokens
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")
def count_tokens(text: str) -> int:
return len(tokenizer.encode(text))
# Split text into chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size = 512,
chunk_overlap = 24,
length_function = count_tokens,
chunks = text_splitter.create_documents([text])
# STEP 2: Embed text and store embeddings
# Get embedding model
embeddings = OpenAIEmbeddings()
# Create vector database
db = FAISS.from_documents(chunks, embeddings)
# STEP 3: Setup retrieval function
chain = load_qa_chain(OpenAI(temperature=0), chain_type="stuff")
query = "Was ist die Unterhaltspflicht des Mieters?"
docs = db.similarity_search(query)
chain.run(input_documents=docs, question=query)
# STEP 4: Create chatbot with chat memory
qa = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0.1), db.as_retriever())
chat_history = []
def on_submit(_):
query = input_box.value
input_box.value = ""
if query.lower() == 'stop':
result = qa({"question": query, "chat_history": chat_history})
chat_history.append((query, result['answer']))
display(widgets.HTML(f'<b>User:</b> {query}'))
display(widgets.HTML(f'<b><font color="blue">Chatbot:</font></b> {result["answer"]}'))
print("Welcome! Type 'stop' to quit.")
input_box = widgets.Text(placeholder='Enter your question:')