This seems to be the core of their argument as they have shown that they can re-produce verbatim their own material.
Iād like to think that the transformative argument was introduced to prevent everyone from suing everyone (The 'murican way). At the end of the day it should be about fair use, and fair competition.
If lazy Joe,thinks āI will just copy and paste all NYT newspapers, and sell them for half-price Iāll be rich!ā, NYT canāt compete. They need to gather the data, pay the professionals, sustain the accountants, lawyers. Meanwhile Joe is a self-employed genius. If it wasnāt for these laws lazy Joe could effectively kill journalism.
If New York Reader decides to be a competitor and raise their own enterprise with all the staff (see: paying wages and sustaining the economy) itās fair to say that these two will have news articles that bump heads, and probably rip from eachother, but are transformed enough to justify fair use. For THIS I can see why transformative use is reasonable.
Now thereās GPT. GPT is like the honey badger. GPT donāt care. GPT sees 1,000,000 articles and eats it for breakfast and poops out 1,000,000 articles per hour for pennies (per article). GPT is so powerful it threatens all these enterprises which are sustaining thousands of jobs, running & supporting local events. GPT is lazy Joe, multiplied 1,000,000 times.
Itās not even close to a perfect comparison. The point is that one competitors supports the economy both locally and nationally, while the other just supports itself. In both cases other entities can use their content for wordspinning.
Will there be a new type of journalism? Probably. Is it fair for models to just take whatever training data they can get their hands on, use it and go ālol transformative broā, not at all. Especially when they turn their own policies to cut you off their services for doing the same thing.
At the very least these companies should be able to say āWell, we didnāt give you permission and demand you remove the training dataā⦠And, how the heck do you do that?
.
RAG baby! Paid-for RAG services to retrieve articles and not used for training data. Along with some sort of law for training data audits⦠somehow.
How the hell is the internet going to be free if everybody is stealing everything for their language models? Itās all going to be locked down and strapped with policies.