I’m curious if anyone has evaluated how often (if at all) GPT-3 generate long text snippets verbatim from the training data?
We are afraid of copyright claims if GPT-3 does this without us knowing. We tried using some plagiarism tools, and when we evaluated 5-10 generations nothing ever got flagged. But obviously this isn’t an exhaustive analysis, and we were using a fixed set of hyperparamters.
- How often does this happen?
- Is it more likely to happen if we set temperature = 0?
- Is it more likely to happen if our prompt is from the training set?
Has anyone thought about this at all?
I have often wondered about this. In universities we use an international system called Turnitin. It is extremely effective. I wonder if OpenAI would set up an agreement with Turnitin to help users check for exactly this thing.
Turnitin Wiki entry
Fascinating stuff! thanks for this!
Turnitin actually set an interesting court precedent about using an entire copyrighted work without permission as part of a for-profit machine learning model when they won their 2009 lawsuit.
Wow, a plagiarism software company got away with using a copy-written work without permission? Why did I not hear about this…that headline just writes itself!
Was that for NLP not NLG? I remember Google going all the way to the Supreme Court in the US and winning permission to slurp up all books in existence for their AI , interesting time to be alive!