FAQ on custom data to support company internal

If a person just wants to explore OpenAI models, then there is nothing special to know about transformers.
However, if a person wants to implement transformers at advanced level, then it is necessary to become an expert.
This GitHub repository contains open-source OpenAI Python notebooks and reading resources to begin digging deeper into transformers :
https://github.com/Denis2054/Transformers-for-NLP-2nd-Edition#readme

I hope this helps.

1 Like

You can try Langchain in your scenario. I created embeddings and stored my data (plain text file) in Pinecone’s vector database. In my case, I used the story of Cinderella just for testing. I was able to get answers solely related to the story. I think that’s what you need in your case.

However, in my case, I want ChatGPT to use BOTH its own knowledge AND my dataset. if I ask questions that are beyond the scope of the story, such as “Who is the author of Cinderella?” I got answer like “I don’t know”. If I ask the same question in regular web interface, I will get a detailed answer.

Is there a way to expand ChatGPT’s knowledge base with my own dataset, so that ChatGPT will look into my own dataset first and then generate answers based on my data AND its trained data ?

Will the Retrieval Plugin solve this problem? Thanks!

AFAIK 100% factual information from your own corpus can be obtained via embeddings or via the Retrieval Plugin.

Embeddings will get a prompt and most probably show an answer built with your corpus. “Most probably” means that it may answer “I don’t know” or whatever you program it to answer in such case.

Retrieval Plugin will receive a prompt and reply with a list of 0 or more snippets from content found in documents indexed in the Retrieval Plugin - a bit like how Google works.

To merge this with ChatGPT you could use the “I don’t know” answer (or whatever you have defined for such case) to trigger a ChatCompletion using the same prompt or an automatically generated variation of it. HOWEVER the answer can be either factual or hallucinations. There’s been some testing documented in this community where, with high epochs during fine-tuning you can force more “knowledge” into the GPT model. There’s a discussion around this, though, where some say this is more like overfitting than learning. IMO it’s a bit of both.

Thank you for the mention.

Here is the link:

Now the question you must ask yourself is the following questions :

If an automated system is accurate 90% of the time how do you manage the other 10%?

How will the end-users know if a response is correct?

If they get an incorrect answer will they smile, complain, sue?

My notebook recommends using a 100% reliable knowledge base that is queried with keywords like a search engine. Then let a LLM formulate the correct answer nicely, possibly in different languages.

A lot has happened since March 20th. :wink: Using CustomGPT and even our own embedding architecture, we have the FAQ system producing outcomes with at least acceptable answers near-100% of the time for our automated test suite that includes 500+ ways to answer questions from a 77-item Q&A corpus. Intentional prompt injections or deliberate hallucination attempts are still on edge but generally thwarted.

We evaluate and measure every response to collect analytics about the performance. For the last 1,000 queries, non have failed. 84% were perfect, 13% were good, and the remainder were poor but deemed acceptable.

I think that’s a good approach in many cases, especially where queries can be deeply aligned with structured information. As it is designed like a search engine, it is also limited to search engine capabilities - full-text, inverted, fuzzy, wildcard, etc.

We were searching for a more accommodating user experience that would allow our customers to use expressions we could not predict. In lifestyle transportation accommodations (i.e., our disappearing truck camper), the use cases are vastly more horizontal than today’s use cases of RVs. As such, we have customers who come from a wide swath of interests and countries, and they use terms to describe their interests that are neither predictable nor reliably classified.

We may have solved this challenge without using a rigid knowledge base. We’re still testing, though.

You can start from this post to mange your company .pdf .txt

I’ve followed the same approach you mention in this post, but i’ve only a doubt: is it really necessary to store the embeddings on a DB? In my test i’ve simply stored the embeddings on files and seems to work quite fine, what could be the disadvantage?

No. That’s a design choice. I’ve experimented with Pinecone, Pandas cached in Streamlit apps, text files in Google Drive, spreadsheets, and Firebase. It’s just data with requirements to access in a manner that meets your objectives.

Performance. Vectors are dense arrays. Retrieval and comparison with a dot product are gating processes that may require data models with certain capabilities.

1 Like

Hi @smartleo , I would like to know how did you achieve your model answering questions that are beyond the scope of the story? I am stuck with a similar situation too. I have fine tuned a model based on a dataset that is highly specific. While my model answers any question in any format correctly, as long as its a part of the dataset, it does not seem to be able to answer questions NOT in the dataset, like “Who is the author of Cinderella?” - I would like it to answer “I dont know” or something similar

Thanks, i can try using sqlite that’s python native and accessible with Pandas :wink:

1 Like

I use embeddings to determine if the nature of the query is above an average similarity threshold. The threshold can be determined a number of ways, but the requirement is simple; establish guardrails and reject conversations that are not in the app’s wheelhouse.

@bill.french Thank you. So do you first use the Fine Tune API and then use the Embeddings API? Or is your whole dataset trained using the embeddings API?

Hi @mikehunt, I am still not able to achieve what I intend to do, i.e. the program should search for answers in my custom dataset first before using its own knowledge.

BTW, I am curious how big your dataset is. I working on another application that answer questions solely based on custom data, but I have difficulties to go beyond 5MB for various reasons.

Process is simple… (in my case) I do not use fine-tuning for any of this.

  1. Create embeddings for each item in the data set.
  2. Create the embedding for each new query.
  3. Use the Dot Product to determine the top five similar data items compared to the query vector.
  4. Sort the top five descending; reject the items from the lowest similarity matches if they are outside the threshold of relevance.
  5. Use the remaining top similarity items to create a learner prompt.
  6. Use a completion based on the prompt that includes the learner prompt and describes how to answer the query.

This is what my system looks like -

What you’re describing is likely best addressed with embeddings and completions. It doesn’t really matter how big the data set is, but performance is required if you don’t want your users waiting a long time for answers.

I can probably help if I knew more about your data set and the underlying database architecture you want to use to build your AI app.

I believe @smartleo and my end goals are the same. Do you have a sample code or github link to your project?
My data looks like this

{"prompt":"Is there a way to delete unused materials-->SLACKINPUT<--","completion":" Pricebook Utilization -->GPT3OUTPUT<--"}
{"prompt":"What is the best way to unlock batches-->SLACKINPUT<--","completion":" Correct Batch Status -->GPT3OUTPUT<--"}

The file contains about 1000 such prompt-completion pairs. Fine Tuning works perfectly on any question in any form on this dataset, but not for random questions. How would I use embeddings for this considering each row is a query in itself?

No. My company’s IP restrictions are significant. But I’ll try to get clearance for something.

Let me understand this better:

A worker enters a query like this:

Is there a way to delete unused materials?

You want the AI app to respond like this:

Pricebook Utilization

Furthermore, you cannot predict the many different ways that a worker will ask the query, right? They might say any of these things:

  • How do I delete unused materials?
  • Help me remove unused materials.
  • I have unused materials; what do I do with them?

You’ll have to explain your fine-tuning methodology because I’m unfamiliar with the way you are framing your narrative. Fine-tuning is a very tedious process that ostensibly builds a LLM variant. Are you actually building a model using this approach?

Yes I am using OpenAI Fine Tuning API/CLI commands, where I train a model on my dataset. And yes, that model is able to answer

  • How do I delete unused materials?
  • Help me remove unused materials.
  • I have unused materials; what do I do with them?
    and other similar questions accurately, as long as it’s a part of the dataset.

What I would like is for the bot to answer “I don’t know” if something random or something not in the dataset gets asked. For example “How’s the weather today”, but if I submit this question in the OpenAI playground running my model, the answers are terrible and random. How do I achieve this either with Fine tuning and/or Embeddings

Okay, so, I don’t think you want a fine-tuned model for this narrow set of requirements.

Instead, you need to create embedding vectors for each of your 1,000 items. And then, when a worker enters a query, you get the embedding vector for the query. Then you simply compare the vector with each of the 1,000 items. The one with the highest score is likely the answer, and you return to the user the related response desired.

The vector comparisons are performed using a dot product or cosine. This returns a value from 0 to 1 indicating how similar the query is to each of your 1,000 items. I’ve written about that part of the solution here.

The 1,000-item database should contain the data you showed earlier, including the responses, so that when you single out the most similar query to that which was asked, you have the answer - you needn’t do anything else.

This approach is fast and cheap, but most importantly, it doesn’t introduce the massive distractions into the conversation that are made possible with fine-tuning. It is a closed environment where you control the outcome. For example, you could set a rule that causes the app to reject any query that doesn’t meet a score of 0.75 or higher on the basis that it’s probably not a relevant question.

Furthermore, this approach doesn’t incur model training costs; only embedding costs, which are 1/600th the rate of an inference call. Lastly, there is no latency waiting for the LLM to process your request and locate your data. You have all the data, and it is a single search enumeration away.

There are cases where a fine-tuned model is required. I don’t think this is one of them.

Another advantage of an embedding architecture is that you can shape the outcome with greater precision. For example, imagine instead of picking the highest similarly hit, you wanted to use the top three and merge them together into a learner prompt, and then use a completion prompt to describe the best approach. Or, offer workers three approaches that may be applicable. This is especially possible if multiple similarity hits come back clustered around a very tight score grouping. A tight clustering of similar items is an indicator of uncertainty and letting the worker chose the final approach may be desired.

UPDATE: BTW, in your current approach, does your prompt instruct your fine-tuned model to say it doesn’t know when the answer cannot be deduced from the stated context? Example …

"Answer the following question from the context, if the answer can not be deduced from the context, say 'I dont know' :\n\n" +
  "Context :\n" +
  paragraph.join("\n\n") +
  "\n\nQuestion :\n" +
  question +
  "?" +
  "\n\nAnswer :";