I know the whole issue of “transformative use” is being debated with respect to the NY Times vs. OpenAI lawsuit, but I have a question that may or may not be related. And I’m wondering if it can be answered with current copy right law.
Let’s say I purchase a copy of Stephen King’s latest novel. I cut it up, scan it, chunk the text and create an embedding of it in my vector store. Next, I create a website “Get Answers to Questions about Stephen King’s Latest Novel”, where people can post their questions, and I answer them.
After a while, I start to use the embeddings from my vector store to find the answers.
So far, so good. I don’t think I’ve violated any copyright rules.
Next, I decide to use an LLM to answer the questions directly. So instead of me taking the question, submitting it to the vector store and rendering an answer, I now let the LLM do this. It does not return any of Mr. King’s novel text, just it’s responses, which may or may not contain excerpts (depending upon the question).
My question is, where have I violated existing copyright law in either case?
Now, do I believe Mr. King should receive some sort of credit/royalty/compensation? Yes, of course. But, am I in violation of Mr. King’s copyright on the novel?
I think you’d technically be in violation in either case, and definitely in the second case. My understanding is that it’s a derivative work that would require express license from the original author.
I did some experiments in the Playground recently where the gpt-4-1106-preview itself will now reject to complete copyright responses, but fully allow copyright inputs.
So in ops, your output would essentially be truncated prematurely.
But like @Diet said above, you could be on the hook for the input. The output may fall more on OAI’s shoulders, especially since the models are censoring them currently. But you could be liable in the output if you also fed it copyright input?
The laws are not 100% clear to me, or anybody else, given the recent lawsuits flying around on this topic of AI and copyright law.
But if you have a high profile system, I would stay clear of anything copyright until the laws get established.
This is not legal advice, just a random internet stranger’s musings. It is copyright infringement if a judge says it is. Whether a judge says a particular use is an infringement will depend on the facts, changes to the law, and the court and counsel’s understanding of technology.
I’m of the opinion that your hypothetical case is not infringement, and I get there by analogy.
The following are not infringement:
written analysis of a book
a review of a book
doing a Google search for a phrase in the book
writing a point by point plot synopsis of a book
hitting ctrl-F on the document
using a TTS screen reader to read a book
search engine indexers of content on the web
archive copies of websites (and books, for that matter).
Copyright, generally speaking, asks the question, “Does the alleged infringement replace the need to purchase the source material?” I don’t think the answer is yes here.
I only do copyright stuff incidentally, so I’m not a copyright expert. I am a high tech attorney, so im well versed in questions of the underlying software. I think the current copyright laws aren’t really structured to address the AI landscape, but as they stand, I think the laws that apply to search engines are a good proxy.
The Copyright Act allows anyone to photocopy copyrighted works without securing permission from the copyright owner when the photocopying amounts to a “fair use” of the material (17 U.S.C. SS107). The following guidelines describe the boundaries of fair use of photocopied material used in research or the classroom or in a library reserve operation. Fair use cannot always be expressed in numbers – either the number of pages copied or the number of copies distributed. Therefore, an instructor should weigh the various factors listed in the Act and judge whether the intended use of photocopied, copyrighted material is within the spirit of the fair use doctrine. Any serious questions concerning whether a particular photocopying constitutes fair use should be directed to College counsel.
Which goes on to say:
At the very least, instructors may make a single copy of the following for scholarly research or use in teaching or preparing to teach a class:
A chapter from a book
An article from a periodical or newspaper
A short story, short essay or short poem, whether or not from a collective work
A chart, diagram, graph, drawing, cartoon or picture from a book, periodical or newspaper
These examples reflect the most conservative guidelines for fair use. They do not represent inviolate ceilings for the amount of copyrighted material which can be photocopied within the boundaries of fair use. When exceeding these minimum levels, however, you should consider the four factors listed in Section 107 of the Copyright Act to make sure that any additional photocopying is justified. The following demonstrate situations where increased levels of photocopying would remain within the range of fair use:
The inability to obtain another copy of the work because it is not available from another library or a source cannot be obtained within your time constraints;
The intention to photocopy the material only once and not to distribute the material to others;
The ability to keep the amount of material photocopied within a proportion reasonable to the entire work (the larger the work, the greater amount of material which may be photocopied).
Most single-copy photocopying for your personal use in research – even when it involves a substantial portion of a work – may well constitute fair use.
I like your argument from analogy. I’m a lawyer too and have done copyright work in the distant past.
I’m thinking, “Which of the acts qualifying as infringement by the C. Act of 1976 is SomebodySysop engaging in?” Copying seems to perhaps be the only one, because, say, an ephemeral copy gets made in the computer’s memory. And the search indexing decisions rule out that act as infringement.
The second, non-legal aspect of “does this reduce demand for the source material” is the real world question of, “Are you cutting into the author’s profits?”
OP has a very cool idea, and it doesn’t seem like King or his publisher are doing anything similar.
Even without the ephemeral nature of a copy, New York found in favor of Google in an almost perfect analogy to OP’s hypo.
Hopefully courts can see the similarities with AI embeddings
LEVAL, Circuit Judge:
This copyright dispute tests the boundaries of fair use. Plaintiffs, who are authors of published books under copyright, sued Google, Inc. (“Google”) for copyright infringement in the United States District *207 Court for the Southern District of New York (Chin, J.). They appeal from the grant of summary judgment in Google’s favor. Through its Library Project and its Google Books project, acting without permission of rights holders, Google has made digital copies of tens of millions of books, including Plaintiffs’, that were submitted to it for that purpose by major libraries. Google has scanned the digital copies and established a publicly available search function. An Internet user can use this function to search without charge to determine whether the book contains a specified word or term and also see “snippets” of text containing the searched-for terms. In addition, Google has allowed the participating libraries to download and retain digital copies of the books they submit, under agreements which commit the libraries not to use their digital copies in violation of the copyright laws. These activities of Google are alleged to constitute infringement of Plaintiffs’ copyrights. Plaintiffs sought injunctive and declaratory relief as well as damages.
Google defended on the ground that its actions constitute “fair use,” which, under 17 U.S.C. § 107, is “not an infringement.” The district court agreed. Authors Guild, Inc. v. Google Inc., 954 F.Supp.2d 282, 294 (S.D.N.Y.2013). Plaintiffs brought this appeal.
Plaintiffs contend the district court’s ruling was flawed in several respects. They argue: (1) Google’s digital copying of entire books, allowing users through the snippet function to read portions, is not a “transformative use” within the meaning of Campbell v. Acuff–Rose Music, Inc., 510 U.S. 569, 578–585, 114 S.Ct. 1164, 127 L.Ed.2d 500 (1994), and provides a substitute for Plaintiffs’ works; (2) notwithstanding that Google provides public access to the search and snippet functions without charge and without advertising, its ultimate commercial profit motivation and its derivation of revenue from its dominance of the world-wide Internet search market to which the books project contributes, preclude a finding of fair use; (3) even if Google’s copying and revelations of text do not infringe plaintiffs’ books, they infringe Plaintiffs’ derivative rights in search functions, depriving Plaintiffs of revenues or other benefits they would gain from licensed search markets; (4) Google’s storage of digital copies exposes Plaintiffs to the risk that hackers will make their books freely (or cheaply) available on the Internet, destroying the value of their copyrights; and (5) Google’s distribution of digital copies to participant libraries is not a transformative use, and it subjects Plaintiffs to the risk of loss of copyright revenues through access allowed by libraries. We reject these arguments and conclude that the district court correctly sustained Google’s fair use defense.
Google’s making of a digital copy to provide a search function is a transformative use, which augments public knowledge by making available information about Plaintiffs’ books without providing the public with a substantial substitute for matter protected by the Plaintiffs’ copyright interests in the original works or derivatives of them. The same is true, at least under present conditions, of Google’s provision of the snippet function. Plaintiffs’ contention that Google has usurped their opportunity to access paid and unpaid licensing markets for substantially the same functions that Google provides fails, in part because the licensing markets in fact involve very different functions than those that Google provides, and in part because an author’s derivative rights do not include an exclusive right to supply information (of the sort provided by Google) about her works. Google’s profit motivation does not in these circumstances justify denial of fair use. Google’s program *208 does not, at this time and on the record before us, expose Plaintiffs to an unreasonable risk of loss of copyright value through incursions of hackers. Finally, Google’s provision of digital copies to participating libraries, authorizing them to make non-infringing uses, is non-infringing, and the mere speculative possibility that the libraries might allow use of their copies in an infringing manner does not make Google a contributory infringer. Plaintiffs have failed to show a material issue of fact in dispute.
I don’t think so – the question is, how is the answer rendered?
You can have a podcast that discusses a book, and it doesn’t require any particular kind of credit or royalties to the book author.
In copyright law, “derivative work” has a specific meaning, and I personally don’t think the described use case rises to that meaning.
But, it largely depends on how the responses are generated.
If the embedding matches end up mapping to excerpts from the book, and the book text is provided back to the model as part of the prompt to answer each question, then there’s more of a question – if you bought the book in electronic form without DRM, you may be allowed to do this, maybe? But if you scanned a paper book, the act of transferring that scanned text into the computer for serving, might be infringing – it all depends.
Which is why, really, we should let the courts figure this out, and get high quality lawyers to interpret our specific use cases, rather than speculating with internet randos such as myself
And with this you’d probably be okay, but Stephen King—and more importantly Stephen King’s publishers—have pockets much deeper than you do (presumably).
One of the most important things to know about copyright law is that, at the end of the day, if something is or is not a violation is settled in court and the presumptive rights given to the author of a work are very strong.
OpenAI is being sued by the NYT for copyright infringement for using their works in training data which is a layer of abstraction well beyond embeddings.
Creating a bot with direct access to the full text of an author’s work may (or may not) be seen as a derivative work by a court. Whether or not it is, I can say with almost certainty that you wouldn’t want to be the test case to figure it out.