FAQ on custom data to support company internal

@bill.french Thank you. So do you first use the Fine Tune API and then use the Embeddings API? Or is your whole dataset trained using the embeddings API?

Hi @mikehunt, I am still not able to achieve what I intend to do, i.e. the program should search for answers in my custom dataset first before using its own knowledge.

BTW, I am curious how big your dataset is. I working on another application that answer questions solely based on custom data, but I have difficulties to go beyond 5MB for various reasons.

Process is simple… (in my case) I do not use fine-tuning for any of this.

  1. Create embeddings for each item in the data set.
  2. Create the embedding for each new query.
  3. Use the Dot Product to determine the top five similar data items compared to the query vector.
  4. Sort the top five descending; reject the items from the lowest similarity matches if they are outside the threshold of relevance.
  5. Use the remaining top similarity items to create a learner prompt.
  6. Use a completion based on the prompt that includes the learner prompt and describes how to answer the query.

This is what my system looks like -

What you’re describing is likely best addressed with embeddings and completions. It doesn’t really matter how big the data set is, but performance is required if you don’t want your users waiting a long time for answers.

I can probably help if I knew more about your data set and the underlying database architecture you want to use to build your AI app.

I believe @smartleo and my end goals are the same. Do you have a sample code or github link to your project?
My data looks like this

{"prompt":"Is there a way to delete unused materials-->SLACKINPUT<--","completion":" Pricebook Utilization -->GPT3OUTPUT<--"}
{"prompt":"What is the best way to unlock batches-->SLACKINPUT<--","completion":" Correct Batch Status -->GPT3OUTPUT<--"}

The file contains about 1000 such prompt-completion pairs. Fine Tuning works perfectly on any question in any form on this dataset, but not for random questions. How would I use embeddings for this considering each row is a query in itself?

No. My company’s IP restrictions are significant. But I’ll try to get clearance for something.

Let me understand this better:

A worker enters a query like this:

Is there a way to delete unused materials?

You want the AI app to respond like this:

Pricebook Utilization

Furthermore, you cannot predict the many different ways that a worker will ask the query, right? They might say any of these things:

  • How do I delete unused materials?
  • Help me remove unused materials.
  • I have unused materials; what do I do with them?

You’ll have to explain your fine-tuning methodology because I’m unfamiliar with the way you are framing your narrative. Fine-tuning is a very tedious process that ostensibly builds a LLM variant. Are you actually building a model using this approach?

Yes I am using OpenAI Fine Tuning API/CLI commands, where I train a model on my dataset. And yes, that model is able to answer

  • How do I delete unused materials?
  • Help me remove unused materials.
  • I have unused materials; what do I do with them?
    and other similar questions accurately, as long as it’s a part of the dataset.

What I would like is for the bot to answer “I don’t know” if something random or something not in the dataset gets asked. For example “How’s the weather today”, but if I submit this question in the OpenAI playground running my model, the answers are terrible and random. How do I achieve this either with Fine tuning and/or Embeddings

Okay, so, I don’t think you want a fine-tuned model for this narrow set of requirements.

Instead, you need to create embedding vectors for each of your 1,000 items. And then, when a worker enters a query, you get the embedding vector for the query. Then you simply compare the vector with each of the 1,000 items. The one with the highest score is likely the answer, and you return to the user the related response desired.

The vector comparisons are performed using a dot product or cosine. This returns a value from 0 to 1 indicating how similar the query is to each of your 1,000 items. I’ve written about that part of the solution here.

The 1,000-item database should contain the data you showed earlier, including the responses, so that when you single out the most similar query to that which was asked, you have the answer - you needn’t do anything else.

This approach is fast and cheap, but most importantly, it doesn’t introduce the massive distractions into the conversation that are made possible with fine-tuning. It is a closed environment where you control the outcome. For example, you could set a rule that causes the app to reject any query that doesn’t meet a score of 0.75 or higher on the basis that it’s probably not a relevant question.

Furthermore, this approach doesn’t incur model training costs; only embedding costs, which are 1/600th the rate of an inference call. Lastly, there is no latency waiting for the LLM to process your request and locate your data. You have all the data, and it is a single search enumeration away.

There are cases where a fine-tuned model is required. I don’t think this is one of them.

Another advantage of an embedding architecture is that you can shape the outcome with greater precision. For example, imagine instead of picking the highest similarly hit, you wanted to use the top three and merge them together into a learner prompt, and then use a completion prompt to describe the best approach. Or, offer workers three approaches that may be applicable. This is especially possible if multiple similarity hits come back clustered around a very tight score grouping. A tight clustering of similar items is an indicator of uncertainty and letting the worker chose the final approach may be desired.

UPDATE: BTW, in your current approach, does your prompt instruct your fine-tuned model to say it doesn’t know when the answer cannot be deduced from the stated context? Example …

"Answer the following question from the context, if the answer can not be deduced from the context, say 'I dont know' :\n\n" +
  "Context :\n" +
  paragraph.join("\n\n") +
  "\n\nQuestion :\n" +
  question +
  "?" +
  "\n\nAnswer :";