Train chatGPT on confidential dataset

georgejs · January 15, 2023, 4:26pm

I would like to fine-tune chatGPT to design a solution for my clients. I would like to know if there is an option to keep the data confidential. Thanks

DutytoDevelop · January 15, 2023, 4:42pm

drinkingteddy · January 15, 2023, 9:15pm

Fine tuning for ChatGPT is not possible. Neither is it possible to fine tune next best text-davinci-003 (currently). Best option you have is the base Davinci model - and good luck fine tuning that and not going totes insane. Yikes.

georgejs · January 26, 2023, 9:04pm

thank you and this partly answers my question, however, still for the hypothetical scenario that we would like to fine tune using your guide at (https://beta.openai.com/docs/guides/fine-tuning) I’d really like to know what is the extent of the use of the data we upload for fine-tuning. in other words, is there a risk this data can leak? can you use it to sell it on? i believe having legal clarity around this would greatly enhance the possibility of more vertical applications and many would be interested. so to sum up: how safe is my confidential data that i upload to train the model on. thank you

ruby_coder · January 27, 2023, 12:51am

Hey @georgejs

I think the referenced (below) OpenAI document clears everything up. The short answer is that you can contact OpenAI and request they not use your data internally (or externally, I assume).

See:

How your data is used to improve model performance

raymonddavey · January 27, 2023, 5:33am

One way to keep the information confidential (to some degree) is to store the data locally. By using Embedding, you will only send up the small pieces of information required to answer a specific question.

If the issue is people’s names, you might be able to sanitize the data by replacing names (automatically) before you do the training.

If the confidentially is related to knowledge or IP, you have to weigh up if small snippets taken out of context will cause you issues - or if they need to be in the larger surrounding text to make sense. If this is not a problem, embedding is also a good solution (for the reason described above)

georgejs · January 27, 2023, 11:20am

Thank you so much. @raymonddavey @DutytoDevelop. This is very helpful.
In terms of then actually using the fine-tuned model via API. How safe / confidential are the “chats.”

hermeticJay · May 8, 2023, 12:16am

But to create the embeddings in the first place, wouldn’t you have to effectively send all your data to OpenAI in order to get back the vectors? Sure, later on when a user sends a query and you append context you’ll only be sending the relevant chunk(s), but you’ve already sent all your data to OpenAI, right? Or am I missing something?

raymonddavey · May 8, 2023, 12:58am

You are correct. You will have to send your info to do the embedding and would have to rely on OpenAI not to use your data

If this is a problem, your only other choice would be to do the embedding offline using something similar to word2vec - but for sentences

georgejs · May 9, 2023, 7:46pm

Thanks. How would you ago about contacting openAI in case you are aiming to develop a solution for larger corporate client, where data must be kept siloed.

raymonddavey · May 9, 2023, 8:31pm

I guess you could try their support email address. Also I don’t know if you are aware but they changed their policy from opt out to opt in for using your data for training.

Basically they do not use your data unless you allow them to.

The exception is the chatgpt app itself where they have said they keep the history for 30 days (I think) for your use and then delete it. But they don’t use that for training either

Best to contact them via support for more specifics

DataPrivacyPreserved · July 26, 2023, 2:32pm

@georgejs are you still looking for a solution? we are solving this problem at PlausibleAI. Please contact me with your calendly at contact.plausibleai@gmail.com

vb · July 26, 2023, 4:30pm

Take a look at the screenshot which is from the documentation main page: OpenAI Platform

Topic		Replies	Views
Confidential data handling with GPT3 API	7	3884	February 17, 2024
Will my fine-tuning data remain private? API	3	4097	December 23, 2023
Data privacy and GPT3 fine-tuning API	6	3250	December 23, 2023
Use of confidential information Community	4	4025	December 16, 2023
Access to company's proprietary data for ChatGPT Learning API	3	2150	March 14, 2024

Train chatGPT on confidential dataset

Related topics