I’m curious if anyone has evaluated how often (if at all) GPT-3 generate long text snippets verbatim from the training data?
We are afraid of copyright claims if GPT-3 does this without us knowing. We tried using some plagiarism tools, and when we evaluated 5-10 generations nothing ever got flagged. But obviously this isn’t an exhaustive analysis, and we were using a fixed set of hyperparamters.
How often does this happen?
Is it more likely to happen if we set temperature = 0?
Is it more likely to happen if our prompt is from the training set?
I have often wondered about this. In universities we use an international system called Turnitin. It is extremely effective. I wonder if OpenAI would set up an agreement with Turnitin to help users check for exactly this thing. Turnitin Wiki entry
As a lawyer I recommend you check the terms of use for the material you are using. (The terms are usually in the fine print, yuck.) There may be an outright prohibition on using the material for commercial purposes, even without plagarism. So copyright and plagarism are two separate things. As an example, imagine your own content (say, articles you’ve written and posted on your blog) being downloaded by someone else and used as training data for their own commercial product, without your permission.
Turnitin actually set an interesting court precedent about using an entire copyrighted work without permission as part of a for-profit machine learning model when they won their 2009 lawsuit.
Wow, a plagiarism software company got away with using a copy-written work without permission? Why did I not hear about this…that headline just writes itself!
Was that for NLP not NLG? I remember Google going all the way to the Supreme Court in the US and winning permission to slurp up all books in existence for their AI , interesting time to be alive!