The NY Times is suing OpenAI and Microsoft

This is actually old question and the question should be why now, not earlier. Every SEO company is scraping sites without permission and using that data to sell theirs services. But amount of scraping and size of business is on the totally different level what OpenAI did and does.

But I understand totally why every player who wants to create (low level) content to sell and/or show for ads is now upsetting if that very base content would be banned. Oh my god, they should actually create something :wink:

You are forgetting one point. Someone else has paid that content, plus paid load of scraping too.

But for OpenAI that is very appealing business model indeed.

Hi Jakke

In my opinion SEO is a VERY different case. SEO is using websites’ content without reproducing it publicly, while the whole point of these legal cases as I understand is not in the fact of using the information to produce something based on it, rather - literal reproducing without giving credit to the source.

I personally believe that the first use case (which SEO and intended behavior of LMM’s fall into) is a totally valid one if the information is public.

The second (reproducing) one I would say yes, has some issues, but at the same time it is not the intent of LLMs to reproduce such content, rather - an artifact, almost a bug.

This is a good thing, as it’s about time these large tech companies like OpenAI paid for what they use. Look at Google using all the internet infrastructure we all pay for as tax payers to our governments, when most of the traffic is YouTube, they pay nothing towards the infrastructure. These large tech companies that spring up like openAI, soon turn into tax avoiding monopolies, due to the power and wealth they yield. This means less innovation, less competition, and a worse situation for consumers, and anyone working within that field.
And that information being harvested in this case, has been collected by an existing company, and they should be compesated for that data and it’s usage. OpenAI should pay for what belongs to other companies and people.

Open any book for example, and it clearly states in the many copyright notices why books can not be used in AI training, or stored within a neural network or semantic network model, or even used in inference…

  1. “No part of this book may be used or reproduced in any manner” - so when an AI chat bot quotes from an article/book - this is against copyright law

  2. “No part of this publication may be used or reproduced, stored in a retrieval system, or transmitted in any form” - this clearly states that the contents of the book can not be stored in the AI model, a database or any other electronic retrieval system. Also, it can not be transmitted, so when a PC client uses the openAI API across the internet, the inference response from the AI system is transmitted across the internet using TCP/IP, or in some cases HTTP(S), so this is also breaking copyright.

True this does set precedent in litigation.

But what if it turns out GPT-4 is trained MOSTLY on New York Times data? The world is so crazy I might actually believe that.

(bad joke lol)

No it is not. Quoting per se is always allowed, but it has to be a quote with other relevant content, not a partially copy.

1 Like

Agreed, according to that interpretation every single book report, review, or analysis etc, would be a copyright violation.

Much of this stuff falls under transformative use:

2 Likes

I believe the biggest question the court needs to answer is:

does the inclusion of copyrighted text in the training data for an AI model, amount to republication of said data?

My personal take is that the output of a transformer model, like GPT, is transformative use and completely legal, as long as the training data hasn’t been published.

4 Likes

There has been a recent research paper examining the behavior of the GPT models replying with memorized content from the training data. If I recall correctly then at first it was possible to steer the model towards replicating specific data before a first, partial bugfix was implemented.
Then it could be a case of ignorance does not protect one from punishment.
But at this point we would actually need to see the evidence in order to arrive at a opinion.
This is going to be exciting either way.

2 Likes

There has actually been multiple different versions of this over the years, but I believe you’re thinking this “attack”

Screenshot-2023-11-29-at-8.18.50-AM

American copyright law, more specifically section 512 of the Digital Millennium Copyright Act (DMCA), offers a “safe harbor” to online service providers if they take certain steps to address copyright infringement.

I don’t think it will be an issue, since OpenAI has been very proactive about fixing these problems.

3 Likes

There is also a suggestion to consider…
Patching an exploit is not fixing the underlying vulnerability.

In particular, unlike traditional programs, there is no simple patch for the system like LLMs.

2 Likes

Agreed. In the case of using news article content for training data it’s not much different from reading 5 articles, distilling them, and writing one yourself.

So writing as a career is dead … lol. BUT, being a source of up-to-date facts, local events certainly isn’t. To me it makes sense that these sources will release their content to paying subscribers who then absorb it using their LLM, and wordspin it for whatever purpose they see fit.

So… FEDIVERSE :raised_hands:. The next best thing would be Reddit, where if I wanted to find up-to-date news and facts about Brazil for example I could just visit /r/Brazil

I have been working on a bot that does exactly this. It’s very rudimentary but it pulls trending articles, distills facts, and then wordspins it with some Dall-E images. It lacks depth, generalizes everything, and has no personality or human touch. Basically a bunch of meaningless words :joy:

It would be very nice if I could not scrape data and instead just do something like
Go to federated community that revolves around DevOps → Pay money to run a query of “AWS vs Azure vs GCP” → Get facts, information, common discussions → Use for article writing.

I mean, seriously. We all have computers that are PLENTY sufficient to sustain hosting our own little space. Why do we need to pay Twitter, Youtube, Instgram to host our content? Once home assistants become as common as dishwashers I think this will be our reality :raised_hands:

2 Likes

Thankfully copyright notices in books are not legally binding and there exists exemptions to copyright for fair use in the United States.

At issue is whether or not training a neural network on text which is under copyright protection falls under fair use.

Two cases which will undoubtedly be cited will be Perfect 10 v. Google and Authors Guild v. Google both of which Google won handily.

The main claim for fair use will be that the nature of the infringement is transformative, that is the purpose is to create something entirely new with a different purpose than the original.

One would be very hard pressed to make the claim that OpenAI, in training an AI model on data which includes the text of NYT articles was doing so to create something which would produce exact replicas of that text—in fact, one would generally need to try very hard to get the model to precisely replicate the original training data.

Given the absolutely vast amount of training data present in the model and the relatively small proportion of it which can be attributed to the New York Times, it’s pretty clear the purpose of the model isn’t to replicate New York Times articles.

Additionally, since we’re discussing news which by its nature has a very short shelf-life (what exactly does the New York Times do with unsold copies of its print newspaper?) It’s going to be a pretty hard sell to suggest copying of its articles causes any economic harm even if the nature of the copying isn’t transformative (which it clearly is).

In short, I really don’t see this lawsuit going anywhere.

2 Likes

Let’s go, guys. There has been a clear violation of copyright, and it is right and proper to regulate this aspect; otherwise, the risk is to have Google2, meaning the entire publishing industry on its knees due to a monopoly of intermediation + monetization of news (without consent).

There are some interesting legal points to be made, but some others are far fetched.

Let’s break it down.

  • a person (assume she) could read NY Times articles and many other newspapers (assume old media). She has gained a wide knowledge from all these sources, and can answer questions in her own words (even subconsciously used very similar language). We do not consider it as copy right infringement, for her – the same is true if it is from AI.

  • However, Microsoft/OpenAI chat (the same for Google) return original/quoted sentences from the source and give URL reference directly to the article.

    • These quoted sentences could be interpreted as copy right infringement.
    • But the argument of impact on ads revenue is weak. At its core, users no longer have to go to NYT website, navigate through all these click baits in order to find the right contents.
      There was an old legal case about TV video recording, able to fast-forward skipping ads, it’s a losing case.

But nowadays, it’s hard to say anything, really depending on which side the Judge is.

1 Like

Depends, quoted sentences are usually considered fair use unless we’re talking extensive quoting of the same source.

How much of someone else’s work can I use without getting permission?

Under the fair use doctrine of the U.S. copyright statute, it is permissible to use limited portions of a work including quotes, for purposes such as commentary, criticism, news reporting, and scholarly reports. There are no legal rules permitting the use of a specific number of words, a certain number of musical notes, or percentage of a work. Whether a particular use qualifies as fair use depends on all the circumstances.
Source: https://www.copyright.gov

1 Like

Really appreciate you posted the original quote on the fair use!

I think it’s a weak case, maybe the argument is on whether the usage is in one of the specified “such as” areas.

Bottom line, the current (copy right and many others) law needs development to handle AI related social impacts, much broader than just copy right. We are not there yet.

1 Like

Always happy to help,

I agree that US copyright law is completely broken, although I’m blaming Disney for that, not AI :rofl:

In the case of The New York Times vs OpenAI, it boils down to a few key points if you ask me:

  1. Purpose: OpenAI uses the articles for AI training, not for news sharing.

  2. Nature of Work: is the copyright material fact or fiction? factual work might be more likely to be deemed fair use than using a highly creative work like a novel or a song.

  3. Amount Used: if the output of the models use only small bits of many articles, it might lean towards fair use. But if it used big, important parts, that could be a problem.

  4. Market Effect: If OpenAI’s use doesn’t hurt the NYT’s business (like people aren’t using OpenAI instead of reading the NYT), that’s better for OpenAI’s case.

Tldr.: It’s all about whether OpenAI’s use is “transformative” enough and doesn’t step on the NYT’s toes business-wise. it’s a complex issue and courts can be unpredictable.

6 Likes

Yeah. I feel like it’s a bit unfair as this was not targeted or even conscious for Large Language Models.

Does it step on NYT’s toes? Absolutely.

Are they consuming their copyrighted content at large without credit or payment? Absolutely.

Are they trying to appeal to these publishers through negotiations, only so far grabbing the same publisher that was allegedly paid by the CIA to promote US foreign policy? Absolutely.

If anything I think this will really cause lawmakers to reconsider the laws for the future. It’s obvious that Large Language Models are and will be massive for content creation.

To me, it’s unfair that Company A dedicated the resources and time, pays the wages, and outputs an article for Company B to immediately absorb it for financial gain.

A fair future to me seems to be paying for the content to be absorbed by Large Language Models.

So browsing sites can “opt-out”, but there’s no “opt-out” for training, and if LLMs are somehow kept up-to-date and >10,000 content creators can spit out massive aggregations of recent wordspun articles for <$0.50 to their demographics, who pays who?

Absolutely, the model is completely unaware if its responses are facts, hallucinations, or copyright infringements. It is just happy to produce some text :rofl:

I’m not sure. For me, it’s all about the “purpose” question. If OpenAI used legally obtained copyrighted material for the purpose of teaching their models how to read, write, and understand the English language, then I would consider it fair use. However, if the purpose of said training is to make the models respond using verbatim text directly from the source, I would consider it republication and a violation of NYT’s copyright.

1 Like

This seems to be the core of their argument as they have shown that they can re-produce verbatim their own material.

I’d like to think that the transformative argument was introduced to prevent everyone from suing everyone (The 'murican way). At the end of the day it should be about fair use, and fair competition.

If lazy Joe,thinks “I will just copy and paste all NYT newspapers, and sell them for half-price I’ll be rich!”, NYT can’t compete. They need to gather the data, pay the professionals, sustain the accountants, lawyers. Meanwhile Joe is a self-employed genius. If it wasn’t for these laws lazy Joe could effectively kill journalism.

If New York Reader decides to be a competitor and raise their own enterprise with all the staff (see: paying wages and sustaining the economy) it’s fair to say that these two will have news articles that bump heads, and probably rip from eachother, but are transformed enough to justify fair use. For THIS I can see why transformative use is reasonable.

Now there’s GPT. GPT is like the honey badger. GPT don’t care. GPT sees 1,000,000 articles and eats it for breakfast and poops out 1,000,000 articles per hour for pennies (per article). GPT is so powerful it threatens all these enterprises which are sustaining thousands of jobs, running & supporting local events. GPT is lazy Joe, multiplied 1,000,000 times.

It’s not even close to a perfect comparison. The point is that one competitors supports the economy both locally and nationally, while the other just supports itself. In both cases other entities can use their content for wordspinning.

Will there be a new type of journalism? Probably. Is it fair for models to just take whatever training data they can get their hands on, use it and go “lol transformative bro”, not at all. Especially when they turn their own policies to cut you off their services for doing the same thing.

At the very least these companies should be able to say “Well, we didn’t give you permission and demand you remove the training data”… And, how the heck do you do that? :person_shrugging:.

RAG baby! Paid-for RAG services to retrieve articles and not used for training data. Along with some sort of law for training data audits… somehow.

How the hell is the internet going to be free if everybody is stealing everything for their language models? It’s all going to be locked down and strapped with policies.