In order to train a LLM like GPT-4 massive amounts of training data are needed to create the fascinating results we have come to enjoy in the past few years.
Unlike Axel Springer the Times Corporation is going to court to settle the question of how their content can or should have been used to train GPTs.
The article by the New York times (soft paywall):
This surely is a exciting discussion.
What are your thoughts?
I hope they are not successful, because if they are, and more lawsuits follow, then training models is going to get really really hard, which isn’t good for AI. At least American AI. Japan has already relaxed these restrictions.
I would rather they just expunge all training data from NYT and call it good. It’s probably a tiny fraction of training data anyway, nobody will notice.
PS Also I am not sure they know what they are talking about with Bing either. They mention the articles are repeated verbatim. This isn’t training the LLM, because I recall they were live scraping the content in the search, and then regurgitating the content through the LLM. So RAG basically, not training the weights.
To fix this, have all search engines de-index them, essentially silencing their content to the entire internet minus their subscribers behind the paywall. Not sure the NYT wants this. But anything searchable on the net is now available to LLM’s … and this has nothing to do with training and LLM weights.
From a strategic perspective it’s likely worth the shot for the Times. They probably got a similar offer than what Axel Springer got but figured they could get or should get more.
If they succeed it’s a win. If they fail in court they can maybe still take up on the offer.
OpenAI really took advantage of the current state of the internet when they started training their models and kicked up dust for everybody else. Now everybody is locking up their data and OpenAI will cut anybody off who uses theirs for training.
The Axel Springer movement was gross. There’s no better word for it. Just gross. Wouldn’t be surprised that other publishers also got weak offers that involve top-priority retrieval of their articles.
I hope that everybody wins keep journalism relevant, somehow, and not completely masked by OpenAI
I wonder how much effort and money Google has silently invested into this thought. Surely that has been their top priority with LLMs. How they can mix them together with their current ads platform and not completely rock the boat.
And, hell no. I’ll gladly use another model that doesn’t actively track and monitor me for advertising. No thanks. Nope. Smaller models are catching up and are looking very appealing for niche tasks. I don’t need a megabrain genius to write articles, or interact with my website.
This is where RAG will be the real king. Just train the LLM to be “smart enough” and dodge the litigious content, like NYT, and then enable a simple search to pump the content through the LLM anyway … BOOM , no AI training needed, and now the model really really will parrot everything it sees verbatim with a little bit of its own flair thrown in.
A few ideas spring to mind.
In the future LLMs can likely leverage synthetic data a lot more efficiently than the current model generations can. This should at least pave a way forward.
I haven’t fully considered the impacts on the open source community but I am sure they will find a way!
But what about if in the future one would be able to purchase the RAG knowledge and content as @curt.kennedy puts it separately? It would require better management of the applied knowledge but it can allow content creators to specifically get paid for their work.
OpenAI is so focused on centralization, while the rest of the world (especially the open-source community) moves towards federation
Realistically if federated networks become more common we can gather news and interesting facts from local people, not corporations. Like how Twitter wants to see itself. Same old-man ideology of controlling everything and being “the” source.
A thought I’ve always had is Fediverse communities offering paid-access for their content. Photography, sketches, animations, programs. A community of passionate people that can make some beer money on the side.
We can build powerful programs that gather relevant content from accounts with similar interests, harmonize it, and create some interesting articles using simpler models.
So many potential business models can be spun up around these ideas.
Imagine licensing/buying a law text for students or practicioners chunked according to the specific requirements of the user with the embedding vector size/ embedding model as a plus.
And a army of trained white collar workers can dig through the knowledge and create such datasets ready for use in AI models as needed.
But I feel like we are moving off-topic . Let’s hope this period of legal uncertainty moves by fast and that the technological requirements for improved knowledge sharing mechanisms arrive just as fast.
It appears that some webpages have issues with paywalling the content when accessed in “reader” mode. As the page can’t load the overlay the content is available for free. It should be up to the lawyers and the judges to determine who’s at fault in this case.
I think they have a valid point. As ChatGPT is allowing people to bypass their paywalls by just asking ChatGPT. But also I believe the right to free speech should also be applied to AI. So I have no idea if the New York Times will win or not. Also, I am assuming OpenAI didnt directly feed ChatGPT the information but instead gave it access to the internet for a short period of time. (I could be wrong, correct me if I am)
I think this lawsuit is a step in the right direction, and besides all the usual sword rattling there’s a lot of interesting legal questions that needs to be answered, before there can be a fair negotiation between the parties involved.
In one corner we have OpenAI, who claims that content used during training, and the responses from their models, including snippets from news sources, are fair use. In the other corner we have the New York Times, who claims that OpenAI has violated their copyright.
The court system doesn’t like to make broad decisions based on vague suggestions and prefers solid evidence, so this will come down to what examples of infringing content NYT can produce, and under what circumstances.