FAQ on custom data to support company internal

Imagine you have a company handbook and want to use openai to make a FAQ bot working on the handbook. So colleagues could ask any question and the bot will give you the answer according to the handbook. How would I do this? Prompting the whole handbook is not possible. Fine tuning completions needs too much data, I think. Any idea maybe? Really hope to get some Exchange here.

1 Like

It’s a great question, and while it seems straightforward, there could be several ways to create an ideal solution that is also financially practical. There’s an operating cost for every AI solution, so best to factor these in early in the requirements phase.

@PaulBellow referenced a really good approach using embeddings. I’ve built three KM systems using this exact approach, and embeddings have worked well. I’ve also built a few GPT Q&A systems for personal knowledge management that interoperate at the OS level, making it possible for the solution to work in every app context. More about that here.

To implement company data, begin creating a knowledge base, then use the system and assistant role as shown in this notebook :

Thank you. One question. As I submit fixed texts by roles. How can I make openai find automatically the right answers and maybe also answers according to all the submitted information, I did not yet add the fitting question for

This might help.

You have pointed out THE critical aspect of implementing conversational AI!

There are three aspects of implementation:

  1. Project goal. Knowing exactly what we expect our AI to do.
  2. Designing a good prompt dataset to make sure that when a person uses the AI, the application knows the answers.
  3. And now the most difficult part you have pointed out:
    a dataset for a project needs two columns (at least):
    the prompt, the response

To achieve expertise on this issue, OpenAI 's API now contains a system message and an assistant message on top of the user message.

For more, you can read and run the following notebook I shared on GitHub.

Love this obvious throwback to good requirements management. We see the cool glow of the AI light and forget all the basic stuff. Good reminder.

Did you forget to include the link?

I’m still very new at this, but I often ask myself - do we all need to become AI experts who craft every meal from scratch? I just want a good meal for my workers. I’m starting to see the buy vs rent debate surface in a lot of discussions. AI, after all, is not as easy as it looks.

Crazy idea, perhaps, maybe experiment with this. I used CustomGPT for an experiment, and it was surprisingly accurate for an FAQ of 200 questions and answers. I exported them into a PDF - no fancy formatting required. Then dropped the document into the project and started asking questions. My initial test of 75 questions that were not in the training data was about 88% accurate. Using the CustomGPT API to frame in some prompt guidance will likely nudge accuracy into the 90% range.

1 Like

The new retrieval plugin is the answer here :slight_smile:.


If a person just wants to explore OpenAI models, then there is nothing special to know about transformers.
However, if a person wants to implement transformers at advanced level, then it is necessary to become an expert.
This GitHub repository contains open-source OpenAI Python notebooks and reading resources to begin digging deeper into transformers :

I hope this helps.

1 Like

You can try Langchain in your scenario. I created embeddings and stored my data (plain text file) in Pinecone’s vector database. In my case, I used the story of Cinderella just for testing. I was able to get answers solely related to the story. I think that’s what you need in your case.

However, in my case, I want ChatGPT to use BOTH its own knowledge AND my dataset. if I ask questions that are beyond the scope of the story, such as “Who is the author of Cinderella?” I got answer like “I don’t know”. If I ask the same question in regular web interface, I will get a detailed answer.

Is there a way to expand ChatGPT’s knowledge base with my own dataset, so that ChatGPT will look into my own dataset first and then generate answers based on my data AND its trained data ?

Will the Retrieval Plugin solve this problem? Thanks!

AFAIK 100% factual information from your own corpus can be obtained via embeddings or via the Retrieval Plugin.

Embeddings will get a prompt and most probably show an answer built with your corpus. “Most probably” means that it may answer “I don’t know” or whatever you program it to answer in such case.

Retrieval Plugin will receive a prompt and reply with a list of 0 or more snippets from content found in documents indexed in the Retrieval Plugin - a bit like how Google works.

To merge this with ChatGPT you could use the “I don’t know” answer (or whatever you have defined for such case) to trigger a ChatCompletion using the same prompt or an automatically generated variation of it. HOWEVER the answer can be either factual or hallucinations. There’s been some testing documented in this community where, with high epochs during fine-tuning you can force more “knowledge” into the GPT model. There’s a discussion around this, though, where some say this is more like overfitting than learning. IMO it’s a bit of both.

Thank you for the mention.

Here is the link:

Now the question you must ask yourself is the following questions :

If an automated system is accurate 90% of the time how do you manage the other 10%?

How will the end-users know if a response is correct?

If they get an incorrect answer will they smile, complain, sue?

My notebook recommends using a 100% reliable knowledge base that is queried with keywords like a search engine. Then let a LLM formulate the correct answer nicely, possibly in different languages.

A lot has happened since March 20th. :wink: Using CustomGPT and even our own embedding architecture, we have the FAQ system producing outcomes with at least acceptable answers near-100% of the time for our automated test suite that includes 500+ ways to answer questions from a 77-item Q&A corpus. Intentional prompt injections or deliberate hallucination attempts are still on edge but generally thwarted.

We evaluate and measure every response to collect analytics about the performance. For the last 1,000 queries, non have failed. 84% were perfect, 13% were good, and the remainder were poor but deemed acceptable.

I think that’s a good approach in many cases, especially where queries can be deeply aligned with structured information. As it is designed like a search engine, it is also limited to search engine capabilities - full-text, inverted, fuzzy, wildcard, etc.

We were searching for a more accommodating user experience that would allow our customers to use expressions we could not predict. In lifestyle transportation accommodations (i.e., our disappearing truck camper), the use cases are vastly more horizontal than today’s use cases of RVs. As such, we have customers who come from a wide swath of interests and countries, and they use terms to describe their interests that are neither predictable nor reliably classified.

We may have solved this challenge without using a rigid knowledge base. We’re still testing, though.

You can start from this post to mange your company .pdf .txt

I’ve followed the same approach you mention in this post, but i’ve only a doubt: is it really necessary to store the embeddings on a DB? In my test i’ve simply stored the embeddings on files and seems to work quite fine, what could be the disadvantage?

No. That’s a design choice. I’ve experimented with Pinecone, Pandas cached in Streamlit apps, text files in Google Drive, spreadsheets, and Firebase. It’s just data with requirements to access in a manner that meets your objectives.

Performance. Vectors are dense arrays. Retrieval and comparison with a dot product are gating processes that may require data models with certain capabilities.

1 Like

Hi @smartleo , I would like to know how did you achieve your model answering questions that are beyond the scope of the story? I am stuck with a similar situation too. I have fine tuned a model based on a dataset that is highly specific. While my model answers any question in any format correctly, as long as its a part of the dataset, it does not seem to be able to answer questions NOT in the dataset, like “Who is the author of Cinderella?” - I would like it to answer “I dont know” or something similar

Thanks, i can try using sqlite that’s python native and accessible with Pandas :wink:

1 Like

I use embeddings to determine if the nature of the query is above an average similarity threshold. The threshold can be determined a number of ways, but the requirement is simple; establish guardrails and reject conversations that are not in the app’s wheelhouse.