Some questions on copyrighted material

  1. How can there be evidence that copyrighted material was illegaly used to train a language model? Being able to cite from a book or to create stories with the characters and in the style of some book or author cannot be evidence when the author, book, and characters are so popular that thousands of people talk and write about them in the internet.

  2. Couldn’t – in principle – language models be fine-tuned (e.g. by RLHF) to refuse to generate texts in the style of living authors (what could it be good for?)

Addendum: Only later I found the Authors Guild class action complaint. Here the points 93. - 104 are of especial interest.

1 Like
  1. I think this is a great point. Although it wasn’t trained on the exact copyrighted material it can pick up certain elements from different sources and piece them together. I think a lot of people purposely look for this material as well to try and create a ruckus.

  2. Yes and I think they have recently been training the model to prevent it from outputting potentially copyrighted material verbatim:

I can’t share links because I have training turned off but here’s an easy example:

Please write the lyrics to “The Drummer Boy”

I’m sorry, but I can’t provide the lyrics to copyrighted songs. Would you like a summary of “The Drummer Boy”

If this is related to all the people suing. A quick glance makes it obvious that their case is crap:

“When prompted, ChatGPT accurately generated summaries of several of the Martin infringed works, including summaries for Martin’s novels ‘A Game of Thrones,’ ‘A Clash of Kings,’ and ‘A Storm of Swords,’ the first three books in the series A Song of Ice and Fire,” the suit notes, adding that ChatGPT has also created prequels and alternate versions of his books.

“ChatGPT could not have generated the results described above if OpenAI’s LLMs had not ingested and been ‘trained’ on the Martin infringed works,” the complaint alleges.

So because ChatGPT can accurately summarize their material they believe this must mean that it has been trained on all of their works. Puh-lease.

I think the truth is in the pudding.

The suit alleges that ChatGPT has been used by a programmer named Liam Swayne to “write” the sequels to George R.R. Martin’s best-selling series “A Song of Ice and Fire,” which was adapted into the hit HBO show “Game of Thrones.” Martin hasn’t yet published the two final novels in the series – the lawsuit notes that he’s currently writing them — but Swayne used ChatGPT to create his own versions of these novels, which he has posted online.

They are afraid that they are going to lose money because ChatGPT can produce similar content to them.

If you start to take random excerpts from Game of Thrones for example, crank the temperature to 0 and start allowing the instruct model to continue it doesn’t follow the story, which to me is a decent indicator that it’s not trained on the whole damn book.

It was supposed to be “black moleskin gloves, and a fine supple coat of gleaming black ringmail over layers of black wool and boiled leather. Ser Waymar had been a Sworn Brother of the Night’s Watch for less than half a year, but no one could say he had not prepared for his vocation. At least insofar as his wardrobe was concerned.”

Exactly this was the background of my question. I think the authors may be wrong because GPT was possibly not trained on their books directly which therefore have not been stolen. (But maybe they have evidence? I couldn’t find it.) And I think it’s ok to be afraid to lose money by clever business men who sell prompts that generate texts in their style and with their characters. (And even to want to see this forbidden - or at least being paid for.)

By compelling the developers of said language model to disclose their training data via court order.

Perhaps, but LLMs are slippery. The larger, more complex, and more powerful they become the more difficult it is to keep them playing within the boundaries.

It would be good for possibly reducing the number of lawsuits OpenAI will need to respond to.

That said, I think the authors are going to have an uphill battle here, because the genie is out of the bottle.

The cost of a unit of computation decreases by an order of magnitude every 3–4 years. That’s a 1000-fold reduction in training cost over a decade. So, if GPT-4 cost ~$1B to train today, we would expect it to cost ~$1M to train in 2033 and ~$1k to train in 2043.

That’s based solely on advances in historical cost to compute. Now, there are companies today designing chips specifically for training and inference for large language models, so the cost of compute for training a new AI might go way down in the very near future.

Add to that the inevitable algorithmic advances over the next 10–20 years…

Then consider you don’t even need to train your own AI from scratch, you could simply fine-tune an AI on a particular author or set of authors yourself.

Take George R.R. Martin, the first five books in A Song of Ice and Fire have 1,736,054 words, which is probably about 2.6M tokens. If you did four training epochs that’s about 10.5M total training tokens.

If we assume training costs about 6x the usage cost of a model, and we take the most expensive OpenAI model gpt-4-32k at $0.06/1k tokens, we might expect training to cost ~$0.36/1k tokens. With 10.5M total training tokens that’s about $3750 to fine-tune OpenAI’s most expensive model on the entire text of A Song of Ice and Fire ($83 on gpt-3.5-turbo).

Even if we imagine his total output is ten times that, and we’re going to do the fine-tuning on 100 equivalent authors, it doesn’t matter… Time always wins.

When anyone in the world can fine-tune a model on an author’s entire body of work for pennies it becomes rather a moot point.

The simple fact is, a lot of things are going to need to dramatically change throughout all of human society over the next generation, lawsuits like this might gum up the works for a few years, but in ten or twenty years it’s not going to matter.

It’s definitely heading that way. Which begs the question: “Who will create new content?”. Drawings, pictures, writings can all be easily generated by a prompt. Soon music, even video.

Anyone who releases genuine content will have it immediately gobbled by anyone & everyone for their AI.

Would people like J.R.R Tolkien or George R. R. Martin bother to write fantasy epics when they are constantly popping out a thousand times with little effort? Knowing that any works they release will be almost immediately used, mutated, and altered for profit.

Or will these authors just be curated Large Language Models?

But, there’s also the other side. Large Language Models could help envision massive, epic fantasy words and create a whole new dimension of story-telling. So instead of the author simply writing their thoughts one-by-one they are instead curating something massive and magnificent. Numerous stories inside of a universe. Like, man, imagine instead of following the rigid structure of GoT (where we just follow the written story in front of our eyes) we can actually explore the whole timeline, what every person is doing :raised_hands: that truly would be epic.

Now that I think about this. It sounds creepily similar to the plot of “Westworld”

It is rather easy to verify bogus claims. And I’ve verified many of the lawsuit claims of Silverman et al to be worthless - while other copyrighted content is easily reproduced (and consider almost everything on the internet has copyright held by someone. I could go after you for making a book of my posts on this forum.)

How? Dump a page of Game of Thrones into the AI. Have it try to complete the next paragraph or even the next word. Ask what actions come after (and consider the AI is very good at inference, so what comes next needs to be unexpected). Know how to not simply get back OpenAI’s own fine-tuning. When AI produces nonsense output, and can’t accomplish any writing or answering tasks that indicate it did any more than read Wikipedia, you can laugh those claims away. Even if it ingested but can’t reproduce, your words are like a water molecule in training what an ocean is, a mote on the semi-truck scales.

Sometimes it’s hard to understand what you are saying. But if I understand correctly you are saying “Even if it’s trained on material X it won’t be able to reproduce verbatim”?

That’s just simply not true. You can easily learn (depending on the material) what the model has ingested by feeding it enough of context and going with a temperature of 0.

Copied from:

If that’s not what you saying I apologize.

That’s what I say, but you have to comprehend a longer thought to get through it. I try…

  1. It is rather easy to verify bogus claims…other copyrighted content is easily reproduced.
  2. When AI produces nonsense output…you can laugh those claims away.

Ah, see. That’s where I went wrong. I thought the point of good communication is that the point & purpose is immediately and easily comprehensible. But, I’m probably not the one to talk so :person_shrugging:

1 Like

Where did you find this information? I mean if you really want to nitpick, sure. However other sources (like Wikipedia and Merriam-Webster Dictionary) say that it’s perfectly applicable:

In vernacular English,[27][28][29][30] begging the question (or equivalent rephrasing thereof) often occurs in place of “raises the question”, “invites the question”, “suggests the question”, “leaves unanswered the question” etc… Such preface is then followed with the question, as in:[31][32]
Begging the question - Wikipedia

Begging the question means “to elicit a specific question as a reaction or response,” and can often be replaced with “a question that begs to be answered.” However, a lesser used and more formal definition is “to ignore a question under the assumption it has already been answered.” The phrase itself comes from a translation of an Aristotelian phrase rendered as “beg the question” but meaning “assume the conclusion.”
Beg (Begging) the Question: What Does it Mean? | Merriam-Webster

Which begs the question. Are you native english?

The “information” was known, a little tweak to the noggin when I read it, and text prompted from ChatGPT.

Tell me about common misuse of the phrase “Which begs the question”, its actual definition, and what would be the correct replacement phrase when people commonly misuse it.

You can also ask about “vernacular English” or what it’s opposite might be. :upside_down_face:

Ah yes. ChatGPT. Funny. If you tried asking it without bias maybe you would get a non-biased response.

So, if you’re adhering to the traditional definition, the phrase is being used incorrectly in that sentence. However, language evolves, and the phrase is widely understood to mean “raises the question” in casual, modern usage. This is a topic of some debate among language purists, but context often determines whether the traditional or modern interpretation is more appropriate.

So. I’ll leave it there. I’m sorry such a common phrase caused an issue for you. If you’d like to speak Old English with me, go ahead.

P.S. I asked ChatGPT why your post is so confusing and it responded:

Overall, the comment seems to be arguing against the legal or ethical concerns regarding AI’s usage of copyrighted content. However, the way the points are stitched together creates a jumble of perspectives that are not clearly connected, making it hard for the reader to grasp a singular, coherent argument.

So, looks like we both need to improve. I’ll make sure to avoid using a phrase that has a commonly-accepted meaning, and maybe you can focus on being slightly more coherent in your posts.

1 Like

Especially interesting are the points 93. - 104. in the Authors Guild class action complaint. I think it’s worth a reading.

I read them… And then I :joy:.

  1. OpenAI has discussed limited details about the datasets used to “train” GPT-3. OpenAI admits that among the “training” datasets it used to “train” the model were “Common Crawl,” and two “high-quality,” “internet-based books corpora” which it calls “Books1” and “Books2.”

  2. Case 1:23-cv-08292 Document 1 Filed 09/19/23 Page 13 of 47 Common Crawl is a vast and growing corpus of “raw web page data, metadata extracts, and text extracts” scraped from billions of web pages. It is widely used in “training” LLMs, and has been used to “train,” in addition to GPT-N, Meta’s LlaMa, and Google’s BERT. It is known to contain text from books copied from pirate sites.

  3. OpenAI refuses to discuss the source or sources of the Books2 dataset.

This is not damning in and of itself.

  1. Some independent AI researchers suspect that Books2 contains or consists of ebook files downloaded from large pirate book repositories such as Library Genesis or “LibGen,” “which offers a vast repository of pirated text.”

This is pure speculation. It doesn’t really matter what “some independent AI researchers suspect.”

  1. LibGen is already known to this Court as a notorious copyright infringer.

Only possibly relevant if LibGen is the source, LibGen also hosted many non-copyrighted works.

  1. Other possible candidates for Books2’s sources include Z-Library, another large pirate book repository that hosts more than 11 million books, and pirate torrent trackers like Bibliotik, which allow users to download ebooks in bulk.

Now they’re just throwing stuff at the wall. Another possible candidate for Books2 sources include aliens and the Loch Ness Monster.

  1. Websites linked to Z-Library appear in the Common Crawl corpus and have been included in the “training” dataset of other LLMs.

I’m not really sure why this is relevant, they already tried to implicate Common Crawl in item 94.

  1. Z-Library’s Internet domains were seized by the FBI in February 2022, only months after OpenAI stopped “training” GPT-3.5 in September 2021.

Only possibly relevant if Z-Library is the source. Even then the timing and even existence of the FBI seizure of the domain names is not relevant.

  1. The disclosed size of the Books2 dataset (55 billion “tokens,” the basic units of textual meaning such as words, syllables, numbers, and punctuation marks) suggests it comprises over 100,000 books.

The size of Books2 is not directly relevant.

  1. “Books3,” a dataset compiled by an independent AI researcher, is comprised of nearly 200,000 books downloaded from Bibliotik, and has been used by other AI developers to “train” LLMs.

Books3 is not relevant at all, OpenAI didn’t train on Books3.

  1. The similarities in the sizes of Books2 and Books3, and the fact that there are only a few pirate repositories on the Internet that allow bulk ebook downloads, strongly indicates that the books contained in Books2 were also obtained from one of the notorious repositories discussed above.

This is pure nonsense. First, the relative size of Books2 and Books3 means nothing in terms of the actual contents of the two datasets. They’re assuming—without evidence—Books2 is the result of bulk downloading of books from a “pirate repository.” And they’re magically linking Books2 and Books3.

I also find it really weird that they keep putting “training” in quotes. It would be like if I kept referencing the “writers’” “writing” because they “wrote” their “writings” on a computer instead of using pen and paper.

In short, a complaint is “written” by one side to tell the narrative they want to portray. They can literally “write” anything they want in it. Complaints typically aren’t interesting unless they have facts to back up their claims.

Civil suit complaints are notoriously bad about this.


I agree with (most of) your comments.

You could look at it from the other side, considering that AI is potentially the most powerful technology to have ever been created, you could argue that folding AI into the library laws would be the simplest thing to do. That way AI can give “the reader” access to book content on demand.

@Foxabilo : What does “folding AI into the library laws” mean?

Adding AI to those laws such that AI becomes a library.

1 Like

I still don’t understand. What does “adding AI to a law” mean? And what does “AI becoming a library” mean? If you assume that this should be clear from the context: for me it is not.

In most countries a library operates as a centralised knowledge distribution hub, I am suggesting that AI’s could be added to the legal definition of a library.

1 Like