Training OpenAI on a private dataset

squitorio · January 17, 2023, 2:53am

Hi,

I have a dataset that contains product details for thousands of different products. I want to train a model to be able to answer product questions such as “What’s the most reviewed gray bottle we have?”.

I know I can do this with a ton of prompts/completions, but if I have thousands of different products & I want it to be able to answer everything about these products, wouldn’t I have to give the model 1000+ prompts?

Instead is there a way I can give it a few hundred prompt/completion pairs & then give it this dataset to reference to answer product questions? The only way I see the model being able to answer these product questions accurately is if I gave it several prompt/completion pairs for each of the several thousand products in the dataset.

I feel like there has to be an easier way to teach the model about this data. As a simple test, I gave ChatGPT a portion of the CSV and I asked it to answer questions about the CSV and it did so successfully. If it can do this successfully, there has to be a way to properly train the model in this way.

ruby_coder · January 17, 2023, 3:48am

ChatGPT is an application built on top of the OpenAI API funded by OpenAI.

For people who want different capabilities than ChatGPT, the obvious choice is to build your own ChatCPT-like applications using the OpenAI API.

The OpenAI neural network is proprietary and that dataset is controlled by OpenAI. This dataset cost a millions of dollars to develop and so OpenAI is not going to make this private dataset available for others to modify unless they are willing to pay a very large amount of money.

You cannot train ChatGPT (without the permission of OpenAI) with the exception of fine-tuning, because if OpenAI permitted people to train the dataset, then this would create huge security, information warfare, bias and other ‘bad things’ which OpenAI is not going to allow.

Maybe in the future, OpenAI will create instances of their network which users can train, but that will be very expensive because the OpenAI NN is huge and costs a fortune to run and maintain.

It’s all about money.

ChatGPT is a beta, demo, for the most part, released to demonstrate capabilities (now and future) and for OpenAI to secure more funding.

I see this same type of question asked repeatedly here.

If you want advanced or specific OpenAI capabilities, your choice is to develop your own application using the OpenAI API.

HTH

openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "c4ca8276-e829-4cff-8905-47534e4b4d4e",
   "metadata": {},
   "source": [
    "# Question Answering using Embeddings\n",
    "\n",
    "Many use cases require GPT-3 to respond to user questions with insightful answers. For example, a customer support chatbot may need to provide answers to common questions. The GPT models have picked up a lot of general knowledge in training, but we often need to ingest and use a large library of more specific information.\n",
    "\n",
    "In this notebook we will demonstrate a method for enabling GPT-3 able to answer questions using a library of text as a reference, by using document embeddings and retrieval. We'll be using a dataset of Wikipedia articles about the 2020 Summer Olympic Games. Please see [this notebook](fine-tuned_qa/olympics-1-collect-data.ipynb) to follow the data gathering process."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "9e3839a6-9146-4f60-b74b-19abbc24278d",
   "metadata": {},
   "outputs": [],

This file has been truncated. show original

Now the embedding technique does involve more on your end to make it work (you need compute + database)

If you still want to explore the fine-tuned option, here it is:

github.com

openai/openai-cookbook/blob/main/examples/fine-tuned_qa/olympics-3-train-qa.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<span style=\"color:orange; font-weight:bold\">Note: To answer questions based on text documents, we recommend the procedure in <a href=\"https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb\">Question Answering using Embeddings</a>. Some of the code below may rely on <a href=\"https://github.com/openai/openai-cookbook/tree/main/transition_guides_for_deprecated_API_endpoints\">deprecated API endpoints</a>.</span>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 3. Train a fine-tuning model specialized for Q&A\n",
    "This notebook will utilize the dataset of context, question and answer pairs to additionally create adversarial questions and context pairs, where the question was not generated on that context. In those cases the model will be prompted to answer \"No sufficient context for answering the question\". We will also train a discriminator model, which predicts whether the question can be answered based on the context or not.\n",
    "\n",
    "We will add hard adversarial examples as well, which will be based either on semantically similar sections, or neighbouring sections, originating from the same article."
   ]
  },
  {

This file has been truncated. show original

paul.armstrong · January 19, 2023, 8:47am

There is a cookbook!

HenryObj · January 19, 2023, 10:25am

Yes! There are several of them => https://cookbook.openai.com/

livdun45 · February 26, 2023, 5:22pm

Hi, @squitorio
Did you find an answer to your question? I have the same issue right now. There should be a way to teach the model without providing thousands of prompt:completion pairs.
Please let me know if you’ve found anything on this issue. Maybe we can work something out together

anon10827405 · February 26, 2023, 5:29pm

The solution is to decouple your function; don’t over-rely on GPT.
Use a search engine to inject the prompt with the product information.

It’s cheaper, faster, and easier.

LordofDarkness · March 3, 2023, 6:15am

New API is now release. This will takes Al and company data and turn into learning models. I am in… Traction is coming up!

gandhi_gopinath · March 3, 2023, 11:41am

Hi: I am also interested in this topic. A lot of businesses would want a ChatGPT like interface to respond to queries from their own dataset/knowledge-base

In the gpt-3.5-turbo, there does not yet seem to be a way to do this. Fine-tuning only supports ada, davinci etc

Diktik · March 5, 2023, 1:36pm

I think that fine-tuning is not good to train own data. I think you need using Embeddings for it.

_paolo · March 22, 2023, 2:11pm

What about this:

you first ask GPT to translate the request to a query for your database by providing the necessary information such as schema and format (of course in a way that doesn’t entail SQL injection or involuntary disclosure, e.g. you only ask for the parameters not the full query)
you query the database to retrieve the context
you ask GPT the question by also providing the context from the query result

EXAMPLE
I have a database with Hotels and I want to ask GPT the following “Which hotels are located in Rome? Provide name and address.”

Request to OpenAI

I have the following database:
create table hotel
(
    id INT      not null,
    name        TEXT not null,
    address     TEXT not null,
    city        TEXT not null,
    country     TEXT not null,
    latitude    DOUBLE PRECISION,
    longitude   DOUBLE PRECISION
);

Generate the query to answer the following question:

Which hotels are located in Rome? Provide name and address.

Only provide the code.

Answer from OpenAI

SELECT name, address 
FROM hotel 
WHERE city = 'Rome';

I submit the query above to my database and retrieve the response, then:

Request to OpenAI

The query returned the following in JSON format:
[
  {
    "name": "Acropoli",
    "address": "Via Principe Amedeo, 63-65"
  },
  {
    "name": "Hotel The Brand",
    "address": "Via Degli Aldobrandeschi 295"
  },
  {
    "name": "Alius Hotel",
    "address": "via Nazionale, 243"
  },
  {
    "name": "All Time Relais & Sport Hotel",
    "address": "Via Don Pasquino Borghi 100"
  },
  {
    "name": "Amalia Vaticano Hotel",
    "address": "Via Germanico 66"
  },
  {
    "name": "Hotel American Palace Eur",
    "address": "Via Laurentina, 554"
  },
  {
    "name": "Antico Palazzo Rospigliosi",
    "address": "Via Liberiana 21"
  },
  {
    "name": "Raeli Hotel Archimede",
    "address": "Via Dei Mille, 19"
  },
  {
    "name": "Best Western Ars Hotel",
    "address": "Via Monte Altissimo, 20-24"
  },
  {
    "name": "Hotel Astoria Garden",
    "address": "Via Vittorio Bachelet, 8"
  }
]

Use this data to answer the following question:

Which hotels are located in Rome? Provide name and address.

Response from OpenAI

Acropoli - Via Principe Amedeo, 63-65
Hotel The Brand - Via Degli Aldobrandeschi 295
Alius Hotel - via Nazionale, 243
All Time Relais & Sport Hotel - Via Don Pasquino Borghi 100
Amalia Vaticano Hotel - Via Germanico 66
Hotel American Palace Eur - Via Laurentina, 554
Antico Palazzo Rospigliosi - Via Liberiana 21
Raeli Hotel Archimede - Via Dei Mille, 19
Best Western Ars Hotel - Via Monte Altissimo, 20-24
Hotel Astoria Garden - Via Vittorio Bachelet, 8

hemp · April 14, 2023, 5:33am

Not a good idea, IMO. Can you really trust an AI that has no idea what it’s saying not to ever write a query that is so poorly performing that it will bring down your database? Or to expose confidential data to people who shouldn’t have it? Or to include an insert statement and not just a simple select?

_paolo · April 14, 2023, 5:49am

The example is simplified. As mentioned in my post, in practice I would have it only generate parts of the query and perform some validation to it. The important thing is to translate the human prompt to a set of keys and values to search, not the query itself.

vaibhav.garg · April 14, 2023, 7:02am

You can look at oHandle for this exact use case. DM me if you want more details.

We also ran a prompt injection bounty challenge inviting people to shake things up.

ikaplun · April 20, 2023, 3:24pm

Does employing OpenAI training against a private data set then allow OpenAI / ChatGPT access to this data set? In other words, once you upload your private data, is it then available for OpenAI or anyone else to consume / train against?

Topic		Replies	Views
Use "private" dataset as basis for AI responses Prompting	29	2631	December 16, 2023
Access to company's proprietary data for ChatGPT Learning API	3	2257	March 14, 2024
Has anyone successfully used OpenAI to interpret data sets? Prompting	7	2020	December 18, 2023
OpenAI on own dataset with API API	1	1548	March 2, 2023
ChatGPT fine-tuning as a service API	17	13705	December 13, 2023

Training OpenAI on a private dataset

Related topics