FAQ on custom data to support company internal

manuelfusss · March 17, 2023, 7:38pm

Imagine you have a company handbook and want to use openai to make a FAQ bot working on the handbook. So colleagues could ask any question and the bot will give you the answer according to the handbook. How would I do this? Prompting the whole handbook is not possible. Fine tuning completions needs too much data, I think. Any idea maybe? Really hope to get some Exchange here.

PaulBellow · March 17, 2023, 8:26pm

bill.french · March 17, 2023, 9:35pm

It’s a great question, and while it seems straightforward, there could be several ways to create an ideal solution that is also financially practical. There’s an operating cost for every AI solution, so best to factor these in early in the requirements phase.

@PaulBellow referenced a really good approach using embeddings. I’ve built three KM systems using this exact approach, and embeddings have worked well. I’ve also built a few GPT Q&A systems for personal knowledge management that interoperate at the OS level, making it possible for the solution to work in every app context. More about that here.

denis.rothman76 · March 17, 2023, 11:10pm

To implement company data, begin creating a knowledge base, then use the system and assistant role as shown in this notebook :

github.com

Denis2054/Transformers-for-NLP-2nd-Edition/blob/main/Bonus/Prompt_Engineering_as_an_alternative_to_fine_tuning.ipynb

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "toc_visible": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [

This file has been truncated. show original

manuelfusss · March 18, 2023, 8:30am

Thank you. One question. As I submit fixed texts by roles. How can I make openai find automatically the right answers and maybe also answers according to all the submitted information, I did not yet add the fitting question for

bill.french · March 18, 2023, 11:29am

This might help.

denis.rothman76 · March 18, 2023, 1:03pm

You have pointed out THE critical aspect of implementing conversational AI!

There are three aspects of implementation:

Project goal. Knowing exactly what we expect our AI to do.
Designing a good prompt dataset to make sure that when a person uses the AI, the application knows the answers.
And now the most difficult part you have pointed out:
a dataset for a project needs two columns (at least):
the prompt, the response

To achieve expertise on this issue, OpenAI 's API now contains a system message and an assistant message on top of the user message.

For more, you can read and run the following notebook I shared on GitHub.

AgusPG · March 25, 2023, 11:47am

The new retrieval plugin is the answer here .

denis.rothman76 · March 25, 2023, 8:44pm

If a person just wants to explore OpenAI models, then there is nothing special to know about transformers.
However, if a person wants to implement transformers at advanced level, then it is necessary to become an expert.
This GitHub repository contains open-source OpenAI Python notebooks and reading resources to begin digging deeper into transformers :
https://github.com/Denis2054/Transformers-for-NLP-2nd-Edition#readme

I hope this helps.

smartleo · April 5, 2023, 6:54pm

You can try Langchain in your scenario. I created embeddings and stored my data (plain text file) in Pinecone’s vector database. In my case, I used the story of Cinderella just for testing. I was able to get answers solely related to the story. I think that’s what you need in your case.

However, in my case, I want ChatGPT to use BOTH its own knowledge AND my dataset. if I ask questions that are beyond the scope of the story, such as “Who is the author of Cinderella?” I got answer like “I don’t know”. If I ask the same question in regular web interface, I will get a detailed answer.

Is there a way to expand ChatGPT’s knowledge base with my own dataset, so that ChatGPT will look into my own dataset first and then generate answers based on my data AND its trained data ?

Will the Retrieval Plugin solve this problem? Thanks!

juan_olano · April 6, 2023, 7:04am

AFAIK 100% factual information from your own corpus can be obtained via embeddings or via the Retrieval Plugin.

Embeddings will get a prompt and most probably show an answer built with your corpus. “Most probably” means that it may answer “I don’t know” or whatever you program it to answer in such case.

Retrieval Plugin will receive a prompt and reply with a list of 0 or more snippets from content found in documents indexed in the Retrieval Plugin - a bit like how Google works.

To merge this with ChatGPT you could use the “I don’t know” answer (or whatever you have defined for such case) to trigger a ChatCompletion using the same prompt or an automatically generated variation of it. HOWEVER the answer can be either factual or hallucinations. There’s been some testing documented in this community where, with high epochs during fine-tuning you can force more “knowledge” into the GPT model. There’s a discussion around this, though, where some say this is more like overfitting than learning. IMO it’s a bit of both.

denis.rothman76 · April 28, 2023, 9:29am

Thank you for the mention.

Here is the link:

github.com

Denis2054/Transformers-for-NLP-2nd-Edition/blob/main/Chapter17/Prompt_Engineering_as_an_alternative_to_fine_tuning.ipynb

{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "provenance": [],
      "toc_visible": true
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    },
    "language_info": {
      "name": "python"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "source": [

This file has been truncated. show original

Now the question you must ask yourself is the following questions :

If an automated system is accurate 90% of the time how do you manage the other 10%?

How will the end-users know if a response is correct?

If they get an incorrect answer will they smile, complain, sue?

My notebook recommends using a 100% reliable knowledge base that is queried with keywords like a search engine. Then let a LLM formulate the correct answer nicely, possibly in different languages.

bill.french · April 28, 2023, 12:28pm

A lot has happened since March 20th. Using CustomGPT and even our own embedding architecture, we have the FAQ system producing outcomes with at least acceptable answers near-100% of the time for our automated test suite that includes 500+ ways to answer questions from a 77-item Q&A corpus. Intentional prompt injections or deliberate hallucination attempts are still on edge but generally thwarted.

We evaluate and measure every response to collect analytics about the performance. For the last 1,000 queries, none have failed. 84% were perfect, 13% were good, and the remainder were poor but deemed acceptable.

I think that’s a good approach in many cases, especially where queries can be deeply aligned with structured information. As it is designed like a search engine, it is also limited to search engine capabilities - full-text, inverted, fuzzy, wildcard, etc.

We were searching for a more accommodating user experience that would allow our customers to use expressions we could not predict. In lifestyle transportation accommodations (i.e., our disappearing truck camper), the use cases are vastly more horizontal than today’s use cases of RVs. As such, we have customers who come from a wide swath of interests and countries, and they use terms to describe their interests that are neither predictable nor reliably classified.

We may have solved this challenge without using a rigid knowledge base. We’re still testing, though.

cap · April 28, 2023, 3:00pm

You can start from this post to mange your company .pdf .txt

gianluca.suzzi · May 12, 2023, 9:15am

I’ve followed the same approach you mention in this post, but i’ve only a doubt: is it really necessary to store the embeddings on a DB? In my test i’ve simply stored the embeddings on files and seems to work quite fine, what could be the disadvantage?

bill.french · May 12, 2023, 10:10am

No. That’s a design choice. I’ve experimented with Pinecone, Pandas cached in Streamlit apps, text files in Google Drive, spreadsheets, and Firebase. It’s just data with requirements to access in a manner that meets your objectives.

Performance. Vectors are dense arrays. Retrieval and comparison with a dot product are gating processes that may require data models with certain capabilities.

mikehunt · May 12, 2023, 10:17am

Hi @smartleo , I would like to know how did you achieve your model answering questions that are beyond the scope of the story? I am stuck with a similar situation too. I have fine tuned a model based on a dataset that is highly specific. While my model answers any question in any format correctly, as long as its a part of the dataset, it does not seem to be able to answer questions NOT in the dataset, like “Who is the author of Cinderella?” - I would like it to answer “I dont know” or something similar

gianluca.suzzi · May 12, 2023, 10:18am

Thanks, i can try using sqlite that’s python native and accessible with Pandas

bill.french · May 12, 2023, 11:08am

I use embeddings to determine if the nature of the query is above an average similarity threshold. The threshold can be determined a number of ways, but the requirement is simple; establish guardrails and reject conversations that are not in the app’s wheelhouse.

mikehunt · May 12, 2023, 9:02pm

@bill.french Thank you. So do you first use the Fine Tune API and then use the Embeddings API? Or is your whole dataset trained using the embeddings API?

Topic		Replies	Views
How can I use Embeddings with Chat GPT 3-5 Turbo Prompting	39	48782	December 12, 2023
QA fine-tuned chatbot not answering from the trained data but nonfactual API	73	19111	November 24, 2023
How can we make the answer concise with fine tuning? API fine-tuning , api	8	2960	June 7, 2023
Fine-tuning myths / OpenAI documentation API	24	14797	December 23, 2023
Fine tuning a model for customer service for our specific app Prompting	23	14602	May 14, 2024

FAQ on custom data to support company internal

Related topics