This seems to be the core of their argument as they have shown that they can re-produce verbatim their own material.
I’d like to think that the transformative argument was introduced to prevent everyone from suing everyone (The 'murican way). At the end of the day it should be about fair use, and fair competition.
If lazy Joe,thinks “I will just copy and paste all NYT newspapers, and sell them for half-price I’ll be rich!”, NYT can’t compete. They need to gather the data, pay the professionals, sustain the accountants, lawyers. Meanwhile Joe is a self-employed genius. If it wasn’t for these laws lazy Joe could effectively kill journalism.
If New York Reader decides to be a competitor and raise their own enterprise with all the staff (see: paying wages and sustaining the economy) it’s fair to say that these two will have news articles that bump heads, and probably rip from eachother, but are transformed enough to justify fair use. For THIS I can see why transformative use is reasonable.
Now there’s GPT. GPT is like the honey badger. GPT don’t care. GPT sees 1,000,000 articles and eats it for breakfast and poops out 1,000,000 articles per hour for pennies (per article). GPT is so powerful it threatens all these enterprises which are sustaining thousands of jobs, running & supporting local events. GPT is lazy Joe, multiplied 1,000,000 times.
It’s not even close to a perfect comparison. The point is that one competitors supports the economy both locally and nationally, while the other just supports itself. In both cases other entities can use their content for wordspinning.
Will there be a new type of journalism? Probably. Is it fair for models to just take whatever training data they can get their hands on, use it and go “lol transformative bro”, not at all. Especially when they turn their own policies to cut you off their services for doing the same thing.
At the very least these companies should be able to say “Well, we didn’t give you permission and demand you remove the training data”… And, how the heck do you do that?
.
RAG baby! Paid-for RAG services to retrieve articles and not used for training data. Along with some sort of law for training data audits… somehow.
How the hell is the internet going to be free if everybody is stealing everything for their language models? It’s all going to be locked down and strapped with policies.