Train chatGPT on confidential dataset

I would like to fine-tune chatGPT to design a solution for my clients. I would like to know if there is an option to keep the data confidential. Thanks

2 Likes
3 Likes

Fine tuning for ChatGPT is not possible. Neither is it possible to fine tune next best text-davinci-003 (currently). Best option you have is the base Davinci model - and good luck fine tuning that and not going totes insane. Yikes.

2 Likes

thank you and this partly answers my question, however, still for the hypothetical scenario that we would like to fine tune using your guide at (https://beta.openai.com/docs/guides/fine-tuning) I’d really like to know what is the extent of the use of the data we upload for fine-tuning. in other words, is there a risk this data can leak? can you use it to sell it on? i believe having legal clarity around this would greatly enhance the possibility of more vertical applications and many would be interested. so to sum up: how safe is my confidential data that i upload to train the model on. thank you

Hey @georgejs

I think the referenced (below) OpenAI document clears everything up. The short answer is that you can contact OpenAI and request they not use your data internally (or externally, I assume).

See:

How your data is used to improve model performance

1 Like

One way to keep the information confidential (to some degree) is to store the data locally. By using Embedding, you will only send up the small pieces of information required to answer a specific question.

If the issue is people’s names, you might be able to sanitize the data by replacing names (automatically) before you do the training.

If the confidentially is related to knowledge or IP, you have to weigh up if small snippets taken out of context will cause you issues - or if they need to be in the larger surrounding text to make sense. If this is not a problem, embedding is also a good solution (for the reason described above)

1 Like

Thank you so much. @raymonddavey @DutytoDevelop. This is very helpful.
In terms of then actually using the fine-tuned model via API. How safe / confidential are the “chats.”

1 Like

But to create the embeddings in the first place, wouldn’t you have to effectively send all your data to OpenAI in order to get back the vectors? Sure, later on when a user sends a query and you append context you’ll only be sending the relevant chunk(s), but you’ve already sent all your data to OpenAI, right? Or am I missing something?

You are correct. You will have to send your info to do the embedding and would have to rely on OpenAI not to use your data

If this is a problem, your only other choice would be to do the embedding offline using something similar to word2vec - but for sentences

Thanks. How would you ago about contacting openAI in case you are aiming to develop a solution for larger corporate client, where data must be kept siloed.

I guess you could try their support email address. Also I don’t know if you are aware but they changed their policy from opt out to opt in for using your data for training.

Basically they do not use your data unless you allow them to.

The exception is the chatgpt app itself where they have said they keep the history for 30 days (I think) for your use and then delete it. But they don’t use that for training either

Best to contact them via support for more specifics

1 Like

@georgejs are you still looking for a solution? we are solving this problem at PlausibleAI. Please contact me with your calendly at contact.plausibleai@gmail.com

Take a look at the screenshot which is from the documentation main page: OpenAI Platform