In order to train a LLM like GPT-4 massive amounts of training data are needed to create the fascinating results we have come to enjoy in the past few years.
Unlike Axel Springer the Times Corporation is going to court to settle the question of how their content can or should have been used to train GPTs.
The article by the New York times (soft paywall):
The lawsuit:
This surely is a exciting discussion.
What are your thoughts?
It feels like a last ditch effort by “old media”.
I hope they are not successful, because if they are, and more lawsuits follow, then training models is going to get really really hard, which isn’t good for AI. At least American AI. Japan has already relaxed these restrictions.
I would rather they just expunge all training data from NYT and call it good. It’s probably a tiny fraction of training data anyway, nobody will notice.
PS Also I am not sure they know what they are talking about with Bing either. They mention the articles are repeated verbatim. This isn’t training the LLM, because I recall they were live scraping the content in the search, and then regurgitating the content through the LLM. So RAG basically, not training the weights.
To fix this, have all search engines de-index them, essentially silencing their content to the entire internet minus their subscribers behind the paywall. Not sure the NYT wants this. But anything searchable on the net is now available to LLM’s … and this has nothing to do with training and LLM weights.
From a strategic perspective it’s likely worth the shot for the Times. They probably got a similar offer than what Axel Springer got but figured they could get or should get more.
If they succeed it’s a win. If they fail in court they can maybe still take up on the offer.
I still think content licensing will hurt AI development, and mostly open source models. So sure, let NYT “settle” and get content licensing dollars flowing in … I get it.
To offset this, AI models should do inline ads, right? I’m not against it, but that’s where it’s headed.
Remember:
Every action has an equal and opposite reaction.
The money paid to license the training content is passed onto you. It’s not absorbed by the developers at all.
OpenAI really took advantage of the current state of the internet when they started training their models and kicked up dust for everybody else. Now everybody is locking up their data and OpenAI will cut anybody off who uses theirs for training.
The Axel Springer movement was gross. There’s no better word for it. Just gross. Wouldn’t be surprised that other publishers also got weak offers that involve top-priority retrieval of their articles.
I hope that everybody wins keep journalism relevant, somehow, and not completely masked by OpenAI
I wonder how much effort and money Google has silently invested into this thought. Surely that has been their top priority with LLMs. How they can mix them together with their current ads platform and not completely rock the boat.
And, hell no. I’ll gladly use another model that doesn’t actively track and monitor me for advertising. No thanks. Nope. Smaller models are catching up and are looking very appealing for niche tasks. I don’t need a megabrain genius to write articles, or interact with my website.
This is where RAG will be the real king. Just train the LLM to be “smart enough” and dodge the litigious content, like NYT, and then enable a simple search to pump the content through the LLM anyway … BOOM , no AI training needed, and now the model really really will parrot everything it sees verbatim with a little bit of its own flair thrown in.
It’s a lose-lose, cat’s already out of the bag.
A few ideas spring to mind.
In the future LLMs can likely leverage synthetic data a lot more efficiently than the current model generations can. This should at least pave a way forward.
I haven’t fully considered the impacts on the open source community but I am sure they will find a way!
But what about if in the future one would be able to purchase the RAG knowledge and content as @curt.kennedy puts it separately? It would require better management of the applied knowledge but it can allow content creators to specifically get paid for their work.
OpenAI is so focused on centralization, while the rest of the world (especially the open-source community) moves towards federation
Realistically if federated networks become more common we can gather news and interesting facts from local people, not corporations. Like how Twitter wants to see itself. Same old-man ideology of controlling everything and being “the” source.
A thought I’ve always had is Fediverse communities offering paid-access for their content. Photography, sketches, animations, programs. A community of passionate people that can make some beer money on the side.
We can build powerful programs that gather relevant content from accounts with similar interests, harmonize it, and create some interesting articles using simpler models.
So many potential business models can be spun up around these ideas.
Imagine licensing/buying a law text for students or practicioners chunked according to the specific requirements of the user with the embedding vector size/ embedding model as a plus.
And a army of trained white collar workers can dig through the knowledge and create such datasets ready for use in AI models as needed.
But I feel like we are moving off-topic . Let’s hope this period of legal uncertainty moves by fast and that the technological requirements for improved knowledge sharing mechanisms arrive just as fast.
Agree. The lawsuit text was saying that articles were being quoted verbatim, so I was thinking it was related to this:
So the model had access to the content behind the paywall, and regurgitated it via RAG essentially, and gave the appearance it was trained on the data, which wasn’t the case.
Because the model weights are frozen to some time in the past, the only way to get relevant articles is through some sort of RAG, which has nothing do with the LLM.
I can already see how the lawyers are going to spin this
From page 2 of the lawsuit document:
Defendants’ GenAI tools can generate output that recites Times content verbatim, closely summarizes it, and mimics its expressive style, as demonstrated by scores of examples. See Exhibit J.
So if it’s RAG, it’s not in the LLM. If it’s in the LLM, it can be taken out of training, and added back in via RAG.
The only refuge is silence, and that would be disastrous for any company, especially broad outreach media companies.
It appears that some webpages have issues with paywalling the content when accessed in “reader” mode. As the page can’t load the overlay the content is available for free. It should be up to the lawyers and the judges to determine who’s at fault in this case.
I think they have a valid point. As ChatGPT is allowing people to bypass their paywalls by just asking ChatGPT. But also I believe the right to free speech should also be applied to AI. So I have no idea if the New York Times will win or not. Also, I am assuming OpenAI didnt directly feed ChatGPT the information but instead gave it access to the internet for a short period of time. (I could be wrong, correct me if I am)
Some people are, and it will become more common with web browsing.
It’s a joke that a bot can rip content, and spin it as output.
It’s also a joke trying to read popular publisher pages when > 1/3 of the page is covered with ads.
I think there’s some legal grounds here. Web browsing is gross. The content made is mainly supported by advertising. Web browsing is taking the work of others to display as their own.
Fair. I think I’m venting more about Web Browsing than discussing the lawsuit.
Going back. It seems like:
Defendants insist that their conduct is protected as “fair use” because their unlicensed use of copyrighted content to train GenAI models serves a new “transformative” purpose
I think this lawsuit is a step in the right direction, and besides all the usual sword rattling there’s a lot of interesting legal questions that needs to be answered, before there can be a fair negotiation between the parties involved.
In one corner we have OpenAI, who claims that content used during training, and the responses from their models, including snippets from news sources, are fair use. In the other corner we have the New York Times, who claims that OpenAI has violated their copyright.
The court system doesn’t like to make broad decisions based on vague suggestions and prefers solid evidence, so this will come down to what examples of infringing content NYT can produce, and under what circumstances.