Copyright concerns: How often does GPT-3 generate long text snippets from the training data?

thui · July 28, 2021, 10:01pm

Hi!

I’m curious if anyone has evaluated how often (if at all) GPT-3 generate long text snippets verbatim from the training data?

We are afraid of copyright claims if GPT-3 does this without us knowing. We tried using some plagiarism tools, and when we evaluated 5-10 generations nothing ever got flagged. But obviously this isn’t an exhaustive analysis, and we were using a fixed set of hyperparamters.

How often does this happen?
Is it more likely to happen if we set temperature = 0?
Is it more likely to happen if our prompt is from the training set?

etc. etc

Has anyone thought about this at all?

Thanks!

Bec.Johnson · July 29, 2021, 3:56am

I have often wondered about this. In universities we use an international system called Turnitin. It is extremely effective. I wonder if OpenAI would set up an agreement with Turnitin to help users check for exactly this thing.
Turnitin Wiki entry

TheBarret · April 13, 2022, 4:18am

Fascinating stuff! thanks for this!

lmccallum · April 14, 2022, 2:25am

As a lawyer I recommend you check the terms of use for the material you are using. (The terms are usually in the fine print, yuck.) There may be an outright prohibition on using the material for commercial purposes, even without plagarism. So copyright and plagarism are two separate things. As an example, imagine your own content (say, articles you’ve written and posted on your blog) being downloaded by someone else and used as training data for their own commercial product, without your permission.

TimC · April 14, 2022, 4:31am

Turnitin actually set an interesting court precedent about using an entire copyrighted work without permission as part of a for-profit machine learning model when they won their 2009 lawsuit.

joe.allbright · April 23, 2022, 2:39am

Wow, a plagiarism software company got away with using a copy-written work without permission? Why did I not hear about this…that headline just writes itself!

PaulBellow · April 23, 2022, 6:08pm

Was that for NLP not NLG? I remember Google going all the way to the Supreme Court in the US and winning permission to slurp up all books in existence for their AI , interesting time to be alive!

Topic		Replies	Views
About the content we have written to Artificial Intelligence with GPT API	7	2671	February 28, 2023
Is prompting with copyrighted article actually a copyright infringement? Community api	2	845	January 13, 2024
How to reduce plagiarism in generated content API	2	669	November 2, 2021
A site is stealing and duplicating our GPTs - how can we protect our GPTs? GPT builders chatgpt , gpts , gpt	32	4529	April 30, 2024
What's the best practice when using a copyrighted file for GPT? GPT builders	12	2732	January 8, 2024

Copyright concerns: How often does GPT-3 generate long text snippets from the training data?

Related topics