Langchain: different knowledge depending on language

I’m trying to train a chatbot with domain-specific knowledge (in particular real estate in Switzerland). I created a chatbot, which I feed some information based on a PDF and then I’m running a chatbot with memory function. It works pretty well, in multiple languages even. So I was curious if the knowledge of the chatbot is limited to only the custom knowledge, or if it has some pre-trained knowledge from the model. I first asked some domain specific questions (in English), which were all answered correctly. Then I asked some general knowledge, where the chatbot answered “I don’t know”. So I concluded there is no “outside” knowledge. Then I randomly asked the same question in German (“what’s the capital of Switzerland?”), and suddenly it knew the correct answer.

  • Is this normal behaviour or is this some kind of bug?
  • Is there a way I can tell the chatbot to focus only on the custom knowledge/to include pre-trained general knowledge?

I couldn’t find anything related to this in the LangChain documentation.

I’m trying to train a chatbot with domain-specific knowledge (in particular real estate in Switzerland). I created a chatbot, which I feed some information based on a PDF and then I’m running a chatbot with memory function. It works pretty well, in multiple languages even. So I was curious if the knowledge of the chatbot is limited to only the custom knowledge, or if it has some pre-trained knowledge from the model. I first asked some domain specific questions (in English), which were all answered correctly. Then I asked some general knowledge, where the chatbot answered “I don’t know”. So I concluded there is no “outside” knowledge. Then I randomly asked the same question in German (“what’s the capital of Switzerland?”), and suddenly it knew the correct answer.

  • Is this normal behaviour or is this some kind of bug?
  • Is there a way I can tell the chatbot to focus only on the custom knowledge/to include pre-trained general knowledge?

I couldn’t find anything related to this in the LangChain documentation.

Here the code I’m using:

import os
import pandas as pd
import matplotlib.pyplot as plt
from transformers import GPT2TokenizerFast
from langchain.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import FAISS
from langchain.chains.question_answering import load_qa_chain
from langchain.llms import OpenAI
from langchain.chains import ConversationalRetrievalChain
from IPython.display import display
import ipywidgets as widgets

os.environ["OPENAI_API_KEY"] = "..."

# STEP 1: Split by chunk

# Convert PDF to text
import textract
doc = textract.process("./Allgemeine Bedingungen.pdf")

# Save to .txt and reopen
with open('Allgemeine Bedingungen.txt', 'w') as f:
    f.write(doc.decode('utf-8'))

with open('Allgemeine Bedingungen.txt', 'r') as f:
    text = f.read()

# Create function to count tokens
tokenizer = GPT2TokenizerFast.from_pretrained("gpt2")

def count_tokens(text: str) -> int:
    return len(tokenizer.encode(text))

# Split text into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 512,
    chunk_overlap  = 24,
    length_function = count_tokens,
)

chunks = text_splitter.create_documents([text])

# STEP 2: Embed text and store embeddings

# Get embedding model
embeddings = OpenAIEmbeddings()

# Create vector database
db = FAISS.from_documents(chunks, embeddings)

# STEP 3: Setup retrieval function

chain = load_qa_chain(OpenAI(temperature=0), chain_type="stuff")

query = "Was ist die Unterhaltspflicht des Mieters?"
docs = db.similarity_search(query)

chain.run(input_documents=docs, question=query)

# STEP 4: Create chatbot with chat memory

qa = ConversationalRetrievalChain.from_llm(OpenAI(temperature=0.1), db.as_retriever())

chat_history = []

def on_submit(_):
    query = input_box.value
    input_box.value = ""

    if query.lower() == 'stop':
        print("Cheers!")
        return

    result = qa({"question": query, "chat_history": chat_history})
    chat_history.append((query, result['answer']))

    display(widgets.HTML(f'<b>User:</b> {query}'))
    display(widgets.HTML(f'<b><font color="blue">Chatbot:</font></b> {result["answer"]}'))

print("Welcome! Type 'stop' to quit.")

input_box = widgets.Text(placeholder='Enter your question:')
input_box.on_submit(on_submit)

display(input_box)

Hi,

You can tell the model to only answer questions related to the context in the current prompt, I do not use langchain myself, I assume there is some method for creating your own customised prompt with extra instructions.

Turn on Verbose mode and you can see the prompt it uses. The model definitely has all of it’s general knowledge, but the LangChain prompt may tell it to ignore questions outside of the context you provide.

chain = load_qa_chain(OpenAI(temperature=0), chain_type="stuff", verbose=True)
1 Like

The model might actually have different “knowledge” in different (human) languages. The reason for this is that the model makes predictions from “previous tokens” to “output token.” This means that, when the previous tokens come in the token-order specific to German, it will have trained on one kind of predictions; when the previous tokens come in the token-order specific to English, it will have trained on another kind of predictions.

If you ask the model to translate from German to English, then make an answer, then translate back from English to German, you may get a different answer than if you ask it a question in German, and ask for a direct answer.

That being said – if you want the model to incorporate customs in Switzerland, it might actually be better at doing that while working in German, then when working in English. You’ll have to try different methods and see which one best fits your usecase.

1 Like

Thanks for the input, makes sense. I managed the AI to answer in both languages. However, I noticed some odd behaviour besides the language thing. I’m asking these general knowledge questions to figure out if the AI has outside knowledge, right. Now it knows the capital of some countries, but it seems to not know who Albert Einstein is for instance. Seems odd to me that it knows some general knowledge facts but others not. Is it normal behaviour?

That sounds like something in the middle gets in the way.

image

image

If ChatGPT knows it, the API should know it.

what model are you using? I am also looking for a model that works good with both english and german