Training OpenAI on a private dataset

Hi,

I have a dataset that contains product details for thousands of different products. I want to train a model to be able to answer product questions such as “What’s the most reviewed gray bottle we have?”.

I know I can do this with a ton of prompts/completions, but if I have thousands of different products & I want it to be able to answer everything about these products, wouldn’t I have to give the model 1000+ prompts?

Instead is there a way I can give it a few hundred prompt/completion pairs & then give it this dataset to reference to answer product questions? The only way I see the model being able to answer these product questions accurately is if I gave it several prompt/completion pairs for each of the several thousand products in the dataset.

I feel like there has to be an easier way to teach the model about this data. As a simple test, I gave ChatGPT a portion of the CSV and I asked it to answer questions about the CSV and it did so successfully. If it can do this successfully, there has to be a way to properly train the model in this way.

2 Likes

ChatGPT is an application built on top of the OpenAI API funded by OpenAI.

For people who want different capabilities than ChatGPT, the obvious choice is to build your own ChatCPT-like applications using the OpenAI API.

The OpenAI neural network is proprietary and that dataset is controlled by OpenAI. This dataset cost a millions of dollars to develop and so OpenAI is not going to make this private dataset available for others to modify unless they are willing to pay a very large amount of money.

You cannot train ChatGPT (without the permission of OpenAI) with the exception of fine-tuning, because if OpenAI permitted people to train the dataset, then this would create huge security, information warfare, bias and other ‘bad things’ which OpenAI is not going to allow.

Maybe in the future, OpenAI will create instances of their network which users can train, but that will be very expensive because the OpenAI NN is huge and costs a fortune to run and maintain.

It’s all about money.

ChatGPT is a beta, demo, for the most part, released to demonstrate capabilities (now and future) and for OpenAI to secure more funding.

I see this same type of question asked repeatedly here.

If you want advanced or specific OpenAI capabilities, your choice is to develop your own application using the OpenAI API.

HTH

See Also:

https://beta.openai.com/docs/guides/production-best-practices/building-your-prototype

1 Like

This is a reoccurring question.

I’m still developing my understanding here but to start with you will want to use the OpenAI API.

This does give you access to the ChatGPT model. But the latest models I have tested are still good (although not as good).

Then for your use case, you are going to want to look at either fine-tuning or embedding to get your data set ingested.

This is where I am still learning which is best. Fine-tuning or embedding.

Yes, I am developing my own application using the OpenAI API. This question is to better understand how to do it. I am fine tuning the model, but I am still running into the problem that I described above. Using the OpenAI API are you aware of any way to not only train using prompt/completion pairs but give it access to raw data it would need to answer questions similar to what it was trained on?

2 Likes

Hey Paul,

Sorry, I wasn’t aware of this being a reoccurring question. I did try to search for this, but I only found 1 answer that was a bit unresolved.

I’ve looked at embeddings & it seems great for semantic search, but it’s not what I’m looking for. Fine-tuning the model will absolutely work, but from reading the documentation & testing it out, it has the limitation I pointed out in my question. It seems like I would have to fine-tune it with 10k or even 100k prompt/completion pairs to teach it this new dataset.

Is there really no better way to do this? If it could be trained on both the prompt/completion pairs along with giving it access to the dataset to pull new information from, I believe it could be trained on a few hundred prompt/completion pairs instead.

You can only fine tune as an OpenAI API user (at this point in time).

In the future, my best guess today, is there will be an Azure API where users / customers can pre-train with their own model?

See also, on YT today:

Fine-tuning for Q/A is considered deprecated by the OpenAI cookbook:

It now redirects you to the embedding technique at:

Now the embedding technique does involve more on your end to make it work (you need compute + database)

If you still want to explore the fine-tuned option, here it is:

2 Likes

There is a cookbook!

1 Like

Yes! There are several of them => https://cookbook.openai.com/

2 Likes

Hi, @squitorio
Did you find an answer to your question? I have the same issue right now. There should be a way to teach the model without providing thousands of prompt:completion pairs.
Please let me know if you’ve found anything on this issue. Maybe we can work something out together :slight_smile:

The solution is to decouple your function; don’t over-rely on GPT.
Use a search engine to inject the prompt with the product information.

It’s cheaper, faster, and easier.

New API is now release. This will takes Al and company data and turn into learning models. I am in… Traction is coming up!

1 Like

Hi: I am also interested in this topic. A lot of businesses would want a ChatGPT like interface to respond to queries from their own dataset/knowledge-base

In the gpt-3.5-turbo, there does not yet seem to be a way to do this. Fine-tuning only supports ada, davinci etc

I think that fine-tuning is not good to train own data. I think you need using Embeddings for it.

1 Like

What about this:

  • you first ask GPT to translate the request to a query for your database by providing the necessary information such as schema and format (of course in a way that doesn’t entail SQL injection or involuntary disclosure, e.g. you only ask for the parameters not the full query)
  • you query the database to retrieve the context
  • you ask GPT the question by also providing the context from the query result

EXAMPLE
I have a database with Hotels and I want to ask GPT the following “Which hotels are located in Rome? Provide name and address.”

Request to OpenAI

I have the following database:
create table hotel
(
    id INT      not null,
    name        TEXT not null,
    address     TEXT not null,
    city        TEXT not null,
    country     TEXT not null,
    latitude    DOUBLE PRECISION,
    longitude   DOUBLE PRECISION
);

Generate the query to answer the following question:

Which hotels are located in Rome? Provide name and address.

Only provide the code.

Answer from OpenAI

SELECT name, address 
FROM hotel 
WHERE city = 'Rome';

I submit the query above to my database and retrieve the response, then:

Request to OpenAI

The query returned the following in JSON format:
[
  {
    "name": "Acropoli",
    "address": "Via Principe Amedeo, 63-65"
  },
  {
    "name": "Hotel The Brand",
    "address": "Via Degli Aldobrandeschi 295"
  },
  {
    "name": "Alius Hotel",
    "address": "via Nazionale, 243"
  },
  {
    "name": "All Time Relais & Sport Hotel",
    "address": "Via Don Pasquino Borghi 100"
  },
  {
    "name": "Amalia Vaticano Hotel",
    "address": "Via Germanico 66"
  },
  {
    "name": "Hotel American Palace Eur",
    "address": "Via Laurentina, 554"
  },
  {
    "name": "Antico Palazzo Rospigliosi",
    "address": "Via Liberiana 21"
  },
  {
    "name": "Raeli Hotel Archimede",
    "address": "Via Dei Mille, 19"
  },
  {
    "name": "Best Western Ars Hotel",
    "address": "Via Monte Altissimo, 20-24"
  },
  {
    "name": "Hotel Astoria Garden",
    "address": "Via Vittorio Bachelet, 8"
  }
]

Use this data to answer the following question:

Which hotels are located in Rome? Provide name and address.

Response from OpenAI

Acropoli - Via Principe Amedeo, 63-65
Hotel The Brand - Via Degli Aldobrandeschi 295
Alius Hotel - via Nazionale, 243
All Time Relais & Sport Hotel - Via Don Pasquino Borghi 100
Amalia Vaticano Hotel - Via Germanico 66
Hotel American Palace Eur - Via Laurentina, 554
Antico Palazzo Rospigliosi - Via Liberiana 21
Raeli Hotel Archimede - Via Dei Mille, 19
Best Western Ars Hotel - Via Monte Altissimo, 20-24
Hotel Astoria Garden - Via Vittorio Bachelet, 8

Not a good idea, IMO. Can you really trust an AI that has no idea what it’s saying not to ever write a query that is so poorly performing that it will bring down your database? Or to expose confidential data to people who shouldn’t have it? Or to include an insert statement and not just a simple select?

The example is simplified. As mentioned in my post, in practice I would have it only generate parts of the query and perform some validation to it. The important thing is to translate the human prompt to a set of keys and values to search, not the query itself.

You can look at oHandle for this exact use case. DM me if you want more details.

We also ran a prompt injection bounty challenge inviting people to shake things up.

Does employing OpenAI training against a private data set then allow OpenAI / ChatGPT access to this data set? In other words, once you upload your private data, is it then available for OpenAI or anyone else to consume / train against?

2 Likes