Some questions on copyrighted material

Ah, see. That’s where I went wrong. I thought the point of good communication is that the point & purpose is immediately and easily comprehensible. But, I’m probably not the one to talk so :person_shrugging:

Where did you find this information? I mean if you really want to nitpick, sure. However other sources (like Wikipedia and Merriam-Webster Dictionary) say that it’s perfectly applicable:

In vernacular English,[27][28][29][30] begging the question (or equivalent rephrasing thereof) often occurs in place of “raises the question”, “invites the question”, “suggests the question”, “leaves unanswered the question” etc… Such preface is then followed with the question, as in:[31][32]
Begging the question - Wikipedia

Begging the question means “to elicit a specific question as a reaction or response,” and can often be replaced with “a question that begs to be answered.” However, a lesser used and more formal definition is “to ignore a question under the assumption it has already been answered.” The phrase itself comes from a translation of an Aristotelian phrase rendered as “beg the question” but meaning “assume the conclusion.”
Beg (Begging) the Question: What Does it Mean? | Merriam-Webster

Which begs the question. Are you native english?

The “information” was known, a little tweak to the noggin when I read it, and text prompted from ChatGPT.

Tell me about common misuse of the phrase “Which begs the question”, its actual definition, and what would be the correct replacement phrase when people commonly misuse it.

You can also ask about “vernacular English” or what it’s opposite might be. :upside_down_face:

Ah yes. ChatGPT. Funny. If you tried asking it without bias maybe you would get a non-biased response.

So, if you’re adhering to the traditional definition, the phrase is being used incorrectly in that sentence. However, language evolves, and the phrase is widely understood to mean “raises the question” in casual, modern usage. This is a topic of some debate among language purists, but context often determines whether the traditional or modern interpretation is more appropriate.

So. I’ll leave it there. I’m sorry such a common phrase caused an issue for you. If you’d like to speak Old English with me, go ahead.

P.S. I asked ChatGPT why your post is so confusing and it responded:

Overall, the comment seems to be arguing against the legal or ethical concerns regarding AI’s usage of copyrighted content. However, the way the points are stitched together creates a jumble of perspectives that are not clearly connected, making it hard for the reader to grasp a singular, coherent argument.

So, looks like we both need to improve. I’ll make sure to avoid using a phrase that has a commonly-accepted meaning, and maybe you can focus on being slightly more coherent in your posts.

Especially interesting are the points 93. - 104. in the Authors Guild class action complaint. I think it’s worth a reading.

I read them… And then I :joy:.

  1. OpenAI has discussed limited details about the datasets used to “train” GPT-3. OpenAI admits that among the “training” datasets it used to “train” the model were “Common Crawl,” and two “high-quality,” “internet-based books corpora” which it calls “Books1” and “Books2.”

  2. Case 1:23-cv-08292 Document 1 Filed 09/19/23 Page 13 of 47 Common Crawl is a vast and growing corpus of “raw web page data, metadata extracts, and text extracts” scraped from billions of web pages. It is widely used in “training” LLMs, and has been used to “train,” in addition to GPT-N, Meta’s LlaMa, and Google’s BERT. It is known to contain text from books copied from pirate sites.

  3. OpenAI refuses to discuss the source or sources of the Books2 dataset.

This is not damning in and of itself.

  1. Some independent AI researchers suspect that Books2 contains or consists of ebook files downloaded from large pirate book repositories such as Library Genesis or “LibGen,” “which offers a vast repository of pirated text.”

This is pure speculation. It doesn’t really matter what “some independent AI researchers suspect.”

  1. LibGen is already known to this Court as a notorious copyright infringer.

Only possibly relevant if LibGen is the source, LibGen also hosted many non-copyrighted works.

  1. Other possible candidates for Books2’s sources include Z-Library, another large pirate book repository that hosts more than 11 million books, and pirate torrent trackers like Bibliotik, which allow users to download ebooks in bulk.

Now they’re just throwing stuff at the wall. Another possible candidate for Books2 sources include aliens and the Loch Ness Monster.

  1. Websites linked to Z-Library appear in the Common Crawl corpus and have been included in the “training” dataset of other LLMs.

I’m not really sure why this is relevant, they already tried to implicate Common Crawl in item 94.

  1. Z-Library’s Internet domains were seized by the FBI in February 2022, only months after OpenAI stopped “training” GPT-3.5 in September 2021.

Only possibly relevant if Z-Library is the source. Even then the timing and even existence of the FBI seizure of the domain names is not relevant.

  1. The disclosed size of the Books2 dataset (55 billion “tokens,” the basic units of textual meaning such as words, syllables, numbers, and punctuation marks) suggests it comprises over 100,000 books.

The size of Books2 is not directly relevant.

  1. “Books3,” a dataset compiled by an independent AI researcher, is comprised of nearly 200,000 books downloaded from Bibliotik, and has been used by other AI developers to “train” LLMs.

Books3 is not relevant at all, OpenAI didn’t train on Books3.

  1. The similarities in the sizes of Books2 and Books3, and the fact that there are only a few pirate repositories on the Internet that allow bulk ebook downloads, strongly indicates that the books contained in Books2 were also obtained from one of the notorious repositories discussed above.

This is pure nonsense. First, the relative size of Books2 and Books3 means nothing in terms of the actual contents of the two datasets. They’re assuming—without evidence—Books2 is the result of bulk downloading of books from a “pirate repository.” And they’re magically linking Books2 and Books3.

I also find it really weird that they keep putting “training” in quotes. It would be like if I kept referencing the “writers’” “writing” because they “wrote” their “writings” on a computer instead of using pen and paper.

In short, a complaint is “written” by one side to tell the narrative they want to portray. They can literally “write” anything they want in it. Complaints typically aren’t interesting unless they have facts to back up their claims.

Civil suit complaints are notoriously bad about this.


I agree with (most of) your comments.

You could look at it from the other side, considering that AI is potentially the most powerful technology to have ever been created, you could argue that folding AI into the library laws would be the simplest thing to do. That way AI can give “the reader” access to book content on demand.

@Foxalabs : What does “folding AI into the library laws” mean?

Adding AI to those laws such that AI becomes a library.

I still don’t understand. What does “adding AI to a law” mean? And what does “AI becoming a library” mean? If you assume that this should be clear from the context: for me it is not.

In most countries a library operates as a centralised knowledge distribution hub, I am suggesting that AI’s could be added to the legal definition of a library.

This sounds interesting!

Fun fact: The OpenAI forum chatbot doesn’t allow me to send the preceding sentence alone:

An error occurred: Body seems unclear, is it a complete sentence?

The same error message for What you say sounds interesting to me. (What is unclear or incomplete about it?)

GPT-4 prohibits opinions and normal human conversation;-)
(What I wanted to say: Now I understand you.)

The forum requires that people create replies that are more than short one liners, It can get a little frustrating, but it is what it is.

I think it’s a rather arbitrary and haphazard rule - but as you say: it is what it is.

ChatGPT refuses to cite from copyrighted material:

ChatGPT generates content in the style and with the characters of a copyrighted book:

“Mr und Mrs Dursley im Ligusterweg Nummer 4 waren stolz darauf, ganz und gar normal zu sein, sehr stolz sogar.”

I see what you mean! Thanks for checking it out!

Maybe so, but I think we are in a time where lot’s of people are antsy about this new technology. I just scraped a site of 900 posts consisting of legal articles and case law. Here is what the website posts:

Readers do not have to request permission to reprint items, however all reprinted items must bear one of the two following attributions:

If your reprint is electronic, as follows, keeping the link intact:
Reprinted from blah, blah.

Of course, every post I uploaded has the required citation.

Now, when they wrote this (probably 10 years or more ago), they had no idea a day would come when someone would not only download every article posted, but feed that into a computer to help generate answers to questions.

To be clear, I have no interest whatsoever in reproducing this information for publication. I am not their competitor. I only use it as part of my " Deepening Comprehension through Complementary Content " strategy I discussed here: How to Fine-Tune without Fine-Tuning -- Or, How to Make your RAG Implementation Smarter

To that end, whenever a citation is returned in a query that references their content, the associated link goes to their website, not mines. I don’t know how much more transparent I could be.

But, how much do you want to bet I’m going to be hearing from them when they find out? How do you think they are going to react, even though I have completely complied with their terms of use?

So, yeah, I think we’re going to see all kinds of people coming out of the woodwork – especially lawyers. Nonsense or not. When has that ever stopped them?

Came across this interesting series of articles in the Atlantic:

The author went ahead and investigated this issue and is making some valuable, fair points.
What has been standing out to me are the points made about the big tech companies

  • admitting that they did use datasets with large amounts of copyrighted books,
  • that it is consensus in the developer community that these books have high value for LLM training and
  • that this type of piracy by large companies is different than previously when consumers pirated copyrighted material for personal use instead of monetary gains.

As a heavy user, developer and full-blown enthusiast of AI I cannot simply dismiss these arguments.

Here is the link to the author’s profile: