Sagan's Blue Dot bug: at least two models refuse to continue the famous quote

Hmm. I’m really confused how you don’t understand this issue.

It’s possible that it’s pointless arguing with you :confused:

Sorry I couldn’t reach ya.

Allow me to inject myself for a moment, as there seems to be some misunderstanding here. @anon22939549 is correct that this “content filtering” is intended behavior, as OpenAI are trying to protect themselves from further frivolous lawsuits.

I fully agree with @_j that there should be some sort of “copyright portal” where right holder’s can submit complaints or make various other claims, regarding the use of their content.

From my point of view these opinions aren’t conflicting.

1 Like

Copyright Law of the United States (Title 17)

A “computer program” is a set of statements or instructions to be used directly or indirectly in a computer in order to bring about a certain result.5

and…

(b)(1)(A) Notwithstanding the provisions of subsection (a), unless authorized by the owners of copyright in the sound recording or the owner of copyright in a computer program (including any tape, disk, or other medium embodying such program), and in the case of a sound recording in the musical works embodied therein, neither the owner of a particular phonorecord nor any person in possession of a particular copy of a computer program (including any tape, disk, or other medium embodying such program), may, for the purposes of direct or indirect commercial advantage, dispose of, or authorize the disposal of, the possession of that phonorecord or computer program (including any tape, disk, or other medium embodying such program) by rental, lease, or lending, or by any other act or practice in the nature of rental, lease, or lending. Nothing in the preceding sentence shall apply to the rental, lease, or lending of a phonorecord for nonprofit purposes by a nonprofit library or nonprofit educational institution. The transfer of possession of a lawfully made copy of a computer program by a nonprofit educational institution to another nonprofit educational institution or to faculty, staff, and students does not constitute rental, lease, or lending for direct or indirect commercial purposes under this subsection.

(4) Any person who distributes a phonorecord or a copy of a computer program (including any tape, disk, or other medium embodying such program) in violation of paragraph (1) is an infringer of copyright under section 501 of this title and is subject to the remedies set forth in sections 502, 503, 504, and 505. Such violation shall not be a criminal offense under section 506 or cause such person to be subject to the criminal penalties set forth in section 2319 of title 18.

Does this bring about a certain result when used directly in a computer?

with open('law_excerpt.txt', 'r') as file:
    copyright_law = file.read()
if user_name == "elmstedt":
  print(copyright_law)
else:
  print('\U0001F44D')

they were probably thinking about a misconception about patents and confused it with copyright

but I don’t think any of this is about copyright. it’s about front loading that particular content filter instead of making it part of the moderation endpoint.

Jay, what are you on about. That’s not what I’m referring to and you know it.

Please stop spamming irrelevant information in the thread.

I literally DO NOT UNDERSTAND what you are trying to say.

Calm down guys, y’all charging in different directions :rofl:

  1. GPT instructions are more of a creative work than other types of code, one may argue, bring about production of behavior in a computer system, and copying them, especially to gain commercial advantage on a compensated platform such as the GPT store, is an infringement.
  2. OpenAI has demonstrated the technology for fingerprinting and identification of works that could be applied to such;
  3. It is heartbreakingly against the spirit of Sagan’s inspiring message to see the words denied at the first token.
1 Like

Just checking: are you a developer, integrating LLMs into products? What’s your background?

I’m just wondering what your PoV is on this, or where it’s coming from.

Please, just explain what idea you are attempting to convey.

I think I can safely say that everyone here has some experience developing stuff using OpenAI’s products, @anon22939549 & myself are usually focused on the more academic side of things :laughing:

content filter on raw model output bad. big trouble now and in future.

content filter belongs in moderation endpoint.

copyright arguments seem to be a sideshow/distraction. (my opinion, I don’t get what you’re really on about - but I could be missing something)

I don’t want to put anyone down but there seems to be a real disconnect in perspectives here, and I’m wondering why.

1 Like

This is only useful if people use the moderation endpoint.

OpenAI must prohibit these types of outputs or they will face neverending lawsuits from now until the end of time.

Then we all get nothing.

1 Like

I think some of this is down a lot of context and meaning being lost when only communicating via text :sweat_smile:

I think we can all agree on this though:

I’ve tried researching if that particular quote is copyrighted, and the results where, not conclusive, although most sources say it is, while discussing why it should be fair use.

As it stands you’ll unfortunately need written permission from Sagan’s estate, to make sure you’re good legally speaking. @elm is correct that:

But we we can hope that this changes once the NYT lawsuit is settled, as of now my best advice is to just paraphrase the quote :+1:

At the end of the day it doesn’t matter. Since OpenAI cannot be expected to know the copyright status of everything in the training corpus it makes sense to err on the side of caution and have a blanket prohibition on the verbatim recitation of training data.

1 Like

This is really puzzling to me. The product will be neutered to oblivion and then we will still have nothing.

That’s not entirely true.

Indemnity

If you are a business or organization, to the extent permitted by law, you will indemnify and hold harmless us, our affiliates, and our personnel, from and against any costs, losses, liabilities, and expenses (including attorneys’ fees) from third party claims arising out of or relating to your use of the Services and Content or any violation of these Terms. https://openai.com/policies/terms-of-use

The paradigm you support will leave us with nothing in the long run.

It’s probably equally bad if OpenAI starts suing its developers (if any developers are actually in that position). It’s not an easy thing, but just putting everyone in a padded room is not a solution.

This neutering has started way before this NYT thing was a thing. I’d say it started after they released chatgpt. Remember 0613?

Very true.

I think the topic is interesting from an academic perspective though, because I would assume that some of the most repeated quotes in the training data, is also the one’s that generally are old enough to be considered fair use :thinking:

My task is to format a lot of audio transcripts into readable text. So far, about 30% of the transcripts trigger the problem, because the speakers quote some famous stuff, or perhaps because the speakers were quoted in newspapers or something.

The problem seems to be very frequent, and it will affect a lot of developers.

If the issue is indeed about regurgitating copyrighted texts from the training data, I would propose the following (mutually compatible) solutions:

  • In the API, don’t just silently truncate the output. Instead, throw an error.
  • Allow developers to opt-out from the copyright filtering. If necessary, make them sign some agreement about it (“You should not use the API to infringe on copyright” etc).
  • Don’t trigger the “training data regurgitation alarm” if the user himself provided the text (as in the “Blue dot” case). In my case, it is clear that the user himself provided the text. It is not from the training data. Please allow me to work with the text I myself have provided.
1 Like

Yup, it’s a PITA. They are under threat of lawsuit so nothing else matters. This copyright filter may just result in landmines from trying to generate articles and/or quoting sources.

Most of the time they duct-tape a solution while they contemplate an actual fix. Most of the time.

Not going to happen. They are ultimately responsible.

OpenAI’s current argument (besides saying that they depend on the sources they ripped content from) (paraphrased) is that the plantiff “abused/misused” their model to spit out the copyright material verbatim, and that correct usage wouldn’t allow for it. So, they aren’t going to do this either.

Two greedy giants fight eachother and the little people ultimately lose. That’s the way it works :person_shrugging:

2 Likes