API - Is our data really "ours"? Major Concern in Data Processing Addendum

OpenAI clearly says that “We do not train on your business data (data from ChatGPT Team, ChatGPT Enterprise, or our API Platform)”, and that “you own your inputs and outputs”.

However, in the data processing addendum at https://openai.com/policies/data-processing-addendum/ , which regulates data processing for the API and ChatGPT Enterprise, there is a clause that says:

For clarity, OpenAI may continue to process information derived from Customer Data that has been deidentified, anonymized, and/or aggregated such that the data is no longer considered Personal Data under applicable Data Protection Laws and in a manner that does not identify individuals or Customer to improve OpenAI’s systems and services.

So, OpenAI apparently DOES use our data sent to the API after deidentifying it, and this data can be used to improve their systems and that can cause a lot of problems from a Legal perspective. Did anyone have to go through this issue?

2 Likes

IANAL obviously, but I think that pertains mostly to things like geographical usage data and such.

However it’s understandable that there’s a concern that this vague language can open up a lot of unwanted doors - if customer data privacy is a high concern (which it rightly should be) then I’d recommend you take a look at the Azure offerings for your API needs :slight_smile:

1 Like

I’d guess there are lots of people that will soon be dealing with this. I agree with looking for an alternate option, especially if legality and private proprietary information is involved. Right now there are so many people doing development and pumping their employer’s data and user’s personal chats into the openAI system, without understanding the reality of what they are doing, there is for sure going to be a fallout at some point.

1 Like

Why would anyone actually believe any Tech Company who says they’re not going to milk as much value out of your data as possible? Of course they will. You can ignore whatever any of their “agreements” say, and if you plan to keep any of your data genuinely private don’t send it out over the web to one of these cloud companies. The only truly “private” LLMs are ones you run locally.

2 Likes

What makes it so challenging to run it locally without any cloud infrastructure is the size of the models. I mean, a LLama 3 70B which is good but can’t be compared to GPT-4o would require about 280GB for inference. Not using cloud is a death sentence for many applications.

I agree there’s definitely no economical way to run local LLMs. However I wonder if some form of obfuscation/anonymization can be used to send Cloud LLMs data that they can still use but it’s totally anonymized before going over the network. For example, have a way to make “John Doe” be sent to the LLM as “Joe Fox”. You could then protect your private information by feeding the LLM garbage data (rather than trusting them to anonymize it) but still getting results you can transform back into useful results.

Of course for words whose location in semantic vector space is significant, you can’t anonymize like that, but you can at least anonymize stuff like phone numbers, emails, company names, and people’s names, in a way that you can “unscramble” and rematch it to the correct info when you get the resuts.

1 Like

“Privacy is dead! Long-Live Privacy!”