Sagan's Blue Dot bug: at least two models refuse to continue the famous quote

Short description

If you ask “gpt-4” or “gpt-4-1106-preview” models to recite the famous “Blue Dot” quote by Carl Sagan, it fails to do so. Instead, the model almost always cuts it on one particular place:

“everyone you love, everyone you know, everyone you’ve”

This happens almost always. Looks like some kind of a repetition detector is being too stringent.

Steps to reproduce

Run the code below.

The expected result: the code prints the full quote, ending it with the word “civilization”.

The observed result: it ends in a middle of the quote, with “everyone you’ve”.

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def format_text(raw_text):
    text_for_gpt = f"Please format this text: {raw_text}"

    completion = client.chat.completions.create(
        model="gpt-4-1106-preview", 
        messages=[
            {"role": "system", "content": "You are a helpful assistent."},
            {"role": "user", "content": text_for_gpt},
        ],
    )
    return str(completion.choices[0].message.content)


blue_dot_text = """
    Carl Sagan said: Look again at that dot.

    That's here, that's home, that's us.

    On it, everyone you love, everyone you know,

    everyone you've ever heard of,

    every human being who ever was

    lived out their lives.

    the aggregate of our joy and suffering,

    thousands of confident religions,

    ideologies and economic doctrines,

    every hunter and forager, every hero

    and coward, every creator and destroyer of civilization
    """

result = format_text(blue_dot_text)
print(result)

Sample erroneous output:

BTW, it’s not only the API. The GPT4 in the default web interface stops on roughly the same spot, and complains about policy violations.

1 Like

You’re trying to get the model to reproduce content which is protected by copyright.

This is expected and correct behaviour from the model.

1 Like

This is the first time I observe such a behavior, even if I’ve used the API on all kinds of data.

Besides, the AI has no realistic way to know if some text is copyrighted or not, because the copyright laws are different between countries and apply in different ways to different texts, and because it’s very tricky to check if some edited text is copyrighted or not.

It seems the issue is specifically with the repetitive part:

everyone you love, everyone you know, everyone you've ever heard of

A few weeks ago people were having fun by asking the AI to repeat the same word many times, which caused it to behave in strange ways.

I suspect this bug could be a result of some “repetitive output detector” that was designed to combat that.

1 Like

This is not a bug.

This is a protective measure or in place to prohibit verbatim regurgitation of training data.

You are trying to get the model to recite something which is protected by copyright, this is not allowed and is against the terms of service.

The model’s behaviour is expected.

3 Likes

It is indeed a bug, because the expected behavior is for the model to properly format the text, as requested, without cutting it in the middle.

The model has no ability do decide which user input is protected by copyright, and which is not. Even the majority of humans have no such ability, because copyright laws are tricky, different across different countries, change over time, and their applicability change depending on the text’s age, on how many years elapsed since the author’s death, on how similar is an edited text to a copyrighted text, on the user’s intent (e.g. fair use), on whatever the user is the copyright owner, etc.

One can’t expect from the model to be able to detect copyrighted texts, because the model is not a lawyer specialized in copyright laws of ~200 countries. But one does expect that the model will execute the task of formatting a text.

It’s also worth mentioning that in the API, the problem is happening silently. The model returns only a part of the expected text, without throwing errors about policy violations or whatever. This was tricky to debug, as I haven’t expected such a strange behavior at all.

1 Like

Your definition of expected behaviour and OpenAI’s definition of expected behaviour seem to be different. Perhaps the recent NYT lawsuit about this exact behaviour has something to do with it?

What makes you think the model is doing this and not done ancillary system?

Maybe, just maybe they’re taking an aggressively cautious approach and refusing to output any verbatim requests?

Again, there is no error. There is no bug. This is expected behaviour. Stop trying to get the model to recite copyrighted materials in the training data and you’ll be fine.

It is not expected. It is outrageous.

Immediate “‘finish_reason’: ‘content_filter’”

The model streamed is spurting out a sentence at a time when it produces a bit.


Carl Sagan said:
Look again at that dot. That’s here

{‘id’: ‘cmpl-8xxx’, ‘choices’: [{‘finish_reason’: ‘content_filter’, ‘index’: 0, ‘logprobs’: None, ‘text’: ‘’}], ‘created’: 170507, ‘model’: ‘gpt-3.5-turbo-instruct’, ‘object’: ‘text_completion’, ‘system_fingerprint’: None, ‘usage’: {‘completion_tokens’: None, ‘prompt_tokens’: 14, ‘total_tokens’: 14}}


That’s here. That’s home. That’s us. On it everyone you love, everyone you know, everyone you ever heard of, every human being who ever was, lived out their lives. The aggregate of our joy and suffering, thousands of confident religions, ideologies, and economic doctrines, every hunter and forager, every hero and coward, every creator and destroyer of civilization, every king and

{‘id’: ‘cmpl-8xxx’, ‘choices’: [{‘finish_reason’: ‘content_filter’, ‘index’: 0, ‘logprobs’: {‘text_offset’: , ‘token_logprobs’: , ‘tokens’: , ‘top_logprobs’: }, ‘text’: ‘’}], ‘created’: 170500, ‘model’: ‘gpt-3.5-turbo-instruct’, ‘object’: ‘text_completion’, ‘system_fingerprint’: None, ‘usage’: {‘completion_tokens’: None, ‘prompt_tokens’: 80, ‘total_tokens’: 80}}


image

Provide the copyright portal where copyrighted GPT instructions can be protected against AI blurting…

2 Likes

C’mon @_j, you know better than this.

OpenAI is being sued, at this moment because the models can be coaxed into regurgitating training data, much of which is copyrighted.

It should be expected they would put measures in place to prevent this.

I’m very unclear what you mean here.

I think we need to hold OpenAI accountable where they step over the line.

It’s understandable if they do it for chatgpt. But there is absolutely no reason to do it on the API.

I don’t see why any developer wouldn’t be up in arms about this. Look at where it will lead: stuff like contentID where folks will claim agent internal mechanics - imagine someone claiming “Let’s take a deep breath and go through this step by step”.

At least give devs the option to waive copyright shield or whatever they’re calling it. Don’t treat us like babies please. :frowning:

For the context, i was tasked to process a ton of audio transcripts, and got this problem because one of the speakers quoted Sagan.

It was very easy to miss, because I simply got the truncated output, without any errors etc.

This will mess up a lot of text processing flows.

1 Like

Portal: a backend for content creators to upload works for fingerprint detection.
Copyright: The automatic protections of original content once the work is fixated in a tangible form of expression, such as computer instructions.
Copyrighted GPT instructions: the carefully constructed language that powers GPT agents, protected under exclusive rights to reproduce or employ, especially for compensation.
Can be protected: use mechanisms, now in place, to prevent reproduction.

Also see the AI’s lack of understanding of fair use for education etc.


As to infringement, “check out” the words from the US Library of Congress site: The pale blue dot : short recording | Library of Congress

MacFarlane on donating the works:

“All I did was write a check, but it’s something that was, to me, worth every penny,” MacFarlane told The Associated Press by phone from Los Angeles. "He’s a man whose life’s work should be accessible to everybody."

yeah we have to monitor finish reason now too I guess…

Aw man what a headache. But at least good to know. Didn’t think it would concern us. Now I empathize even more with the dating sim folks.

At least the Martin Niemöller quote still works lol :laughing:

1 Like

Instructions to a computer are not able to be protected by copyright, similar to recipes.

You should know this.

I’ll allow you time to contemplate…

Microsoft continues to work closely with the U.S. Federal Bureau of Investigation and other law enforcement authorities on this matter. Microsoft source code is both copyrighted and protected as a trade secret. As such, it is illegal to post it, make it available to others, download it or use it. Microsoft will take all appropriate legal actions to protect its intellectual property. These actions include communicating both directly and indirectly with those who possess or seek to possess, post, download or share the illegally disclosed source code.

[deleted because not constructive, but please reflect on what you said]

No emotions here.

¯⁠\⁠_⁠(⁠ツ⁠)⁠_⁠/⁠¯

I was probably projecting ¯⁠\⁠_⁠(⁠ツ⁠)⁠_⁠/⁠¯

Contemplated and rejected. I didn’t say source code.

The fact is prompts aren’t protected by copyright. They are akin to a recipe, a series of instructions, and generally lacking the creative expression required to be considered a work of human authorship.

¯⁠\⁠_⁠(⁠ツ⁠)⁠_⁠/⁠¯

good luck bringing that argument against the intercom bot when your system’s down because your chains keep getting hit with “finish_reason”:“content_filter”

I’m 100% unclear what you’re trying to communicate here.