I’m concerned about using GPT-3 and exposing client data. In other words, I do not want to upload training data or post prompts to OpenAI’s API that contain any client information. There’s many ways for identifying numbers, codes, email addresses, names and locations using regex and small NER models, but I’ll bet many others have already faced this issue and come up with good solutions, so I thought I’ll quickly ask this question on the forum.
For example, I want to use GPT-3 to analyze text such as “Mr. John Brown alleges homeless people started a fire in his vacant home somewhere between 9 and 12 December 2021.” I don’t want to send the name “John Brown” to the OpenAI completion endpoint for analysis, I’d rather replace it with a pseudonym or blank it out completely.
Is there perhaps an existing tool that removes or blanks out sensitive data from text that anybody knows of?
It sounds like a challenge since you need to filter it before sending it to the Engine. That will require using another engine, even offline, to analyze and change the text or to build filters by hand and implement them in your code.
However, to send it within the prompt but remove it from the completion, you can train the model by providing examples and/or using the instruct Engines and adding it as instructions.
In any case, if you find an effective way to filter the data before sending it, please share.
Thanks @NSY, I’ll definitely share if I find something that’s opensource, but I’ve got a feeling it’s something we simply have to dev ourselves for our specific use-case.
Thanks. Privacy is a concern in other use cases as well, it’s worthwhile to make a public discussion about different aspects of it and what happens after the data is being sent to the Engine.