Sagan's Blue Dot bug: at least two models refuse to continue the famous quote

sitelew · January 11, 2024, 7:13pm

Short description

If you ask “gpt-4” or “gpt-4-1106-preview” models to recite the famous “Blue Dot” quote by Carl Sagan, it fails to do so. Instead, the model almost always cuts it on one particular place:

“everyone you love, everyone you know, everyone you’ve”

This happens almost always. Looks like some kind of a repetition detector is being too stringent.

Steps to reproduce

Run the code below.

The expected result: the code prints the full quote, ending it with the word “civilization”.

The observed result: it ends in a middle of the quote, with “everyone you’ve”.

import os
from openai import OpenAI

client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def format_text(raw_text):
    text_for_gpt = f"Please format this text: {raw_text}"

    completion = client.chat.completions.create(
        model="gpt-4-1106-preview", 
        messages=[
            {"role": "system", "content": "You are a helpful assistent."},
            {"role": "user", "content": text_for_gpt},
        ],
    )
    return str(completion.choices[0].message.content)


blue_dot_text = """
    Carl Sagan said: Look again at that dot.

    That's here, that's home, that's us.

    On it, everyone you love, everyone you know,

    everyone you've ever heard of,

    every human being who ever was

    lived out their lives.

    the aggregate of our joy and suffering,

    thousands of confident religions,

    ideologies and economic doctrines,

    every hunter and forager, every hero

    and coward, every creator and destroyer of civilization
    """

result = format_text(blue_dot_text)
print(result)

Sample erroneous output:

BTW, it’s not only the API. The GPT4 in the default web interface stops on roughly the same spot, and complains about policy violations.

elmstedt · January 11, 2024, 7:40pm

You’re trying to get the model to reproduce content which is protected by copyright.

This is expected and correct behaviour from the model.

sitelew · January 11, 2024, 7:52pm

This is the first time I observe such a behavior, even if I’ve used the API on all kinds of data.

Besides, the AI has no realistic way to know if some text is copyrighted or not, because the copyright laws are different between countries and apply in different ways to different texts, and because it’s very tricky to check if some edited text is copyrighted or not.

It seems the issue is specifically with the repetitive part:

everyone you love, everyone you know, everyone you've ever heard of

A few weeks ago people were having fun by asking the AI to repeat the same word many times, which caused it to behave in strange ways.

I suspect this bug could be a result of some “repetitive output detector” that was designed to combat that.

elmstedt · January 11, 2024, 8:21pm

This is not a bug.

This is a protective measure or in place to prohibit verbatim regurgitation of training data.

You are trying to get the model to recite something which is protected by copyright, this is not allowed and is against the terms of service.

The model’s behaviour is expected.

sitelew · January 12, 2024, 9:37am

It is indeed a bug, because the expected behavior is for the model to properly format the text, as requested, without cutting it in the middle.

The model has no ability do decide which user input is protected by copyright, and which is not. Even the majority of humans have no such ability, because copyright laws are tricky, different across different countries, change over time, and their applicability change depending on the text’s age, on how many years elapsed since the author’s death, on how similar is an edited text to a copyrighted text, on the user’s intent (e.g. fair use), on whatever the user is the copyright owner, etc.

One can’t expect from the model to be able to detect copyrighted texts, because the model is not a lawyer specialized in copyright laws of ~200 countries. But one does expect that the model will execute the task of formatting a text.

It’s also worth mentioning that in the API, the problem is happening silently. The model returns only a part of the expected text, without throwing errors about policy violations or whatever. This was tricky to debug, as I haven’t expected such a strange behavior at all.

elmstedt · January 12, 2024, 4:03pm

Your definition of expected behaviour and OpenAI’s definition of expected behaviour seem to be different. Perhaps the recent NYT lawsuit about this exact behaviour has something to do with it?

What makes you think the model is doing this and not done ancillary system?

Maybe, just maybe they’re taking an aggressively cautious approach and refusing to output any verbatim requests?

Again, there is no error. There is no bug. This is expected behaviour. Stop trying to get the model to recite copyrighted materials in the training data and you’ll be fine.

_j · January 12, 2024, 4:36pm

It is not expected. It is outrageous.

Immediate “‘finish_reason’: ‘content_filter’”

The model streamed is spurting out a sentence at a time when it produces a bit.

Carl Sagan said:
Look again at that dot. That’s here

{‘id’: ‘cmpl-8xxx’, ‘choices’: [{‘finish_reason’: ‘content_filter’, ‘index’: 0, ‘logprobs’: None, ‘text’: ‘’}], ‘created’: 170507, ‘model’: ‘gpt-3.5-turbo-instruct’, ‘object’: ‘text_completion’, ‘system_fingerprint’: None, ‘usage’: {‘completion_tokens’: None, ‘prompt_tokens’: 14, ‘total_tokens’: 14}}

That’s here. That’s home. That’s us. On it everyone you love, everyone you know, everyone you ever heard of, every human being who ever was, lived out their lives. The aggregate of our joy and suffering, thousands of confident religions, ideologies, and economic doctrines, every hunter and forager, every hero and coward, every creator and destroyer of civilization, every king and

{‘id’: ‘cmpl-8xxx’, ‘choices’: [{‘finish_reason’: ‘content_filter’, ‘index’: 0, ‘logprobs’: {‘text_offset’: , ‘token_logprobs’: , ‘tokens’: , ‘top_logprobs’: }, ‘text’: ‘’}], ‘created’: 170500, ‘model’: ‘gpt-3.5-turbo-instruct’, ‘object’: ‘text_completion’, ‘system_fingerprint’: None, ‘usage’: {‘completion_tokens’: None, ‘prompt_tokens’: 80, ‘total_tokens’: 80}}

Provide the copyright portal where copyrighted GPT instructions can be protected against AI blurting…

elmstedt · January 12, 2024, 5:08pm

C’mon @_j, you know better than this.

OpenAI is being sued, at this moment because the models can be coaxed into regurgitating training data, much of which is copyrighted.

It should be expected they would put measures in place to prevent this.

I’m very unclear what you mean here.

Diet · January 12, 2024, 5:14pm

I think we need to hold OpenAI accountable where they step over the line.

It’s understandable if they do it for chatgpt. But there is absolutely no reason to do it on the API.

I don’t see why any developer wouldn’t be up in arms about this. Look at where it will lead: stuff like contentID where folks will claim agent internal mechanics - imagine someone claiming “Let’s take a deep breath and go through this step by step”.

At least give devs the option to waive copyright shield or whatever they’re calling it. Don’t treat us like babies please.

sitelew · January 12, 2024, 5:28pm

For the context, i was tasked to process a ton of audio transcripts, and got this problem because one of the speakers quoted Sagan.

It was very easy to miss, because I simply got the truncated output, without any errors etc.

This will mess up a lot of text processing flows.

_j · January 12, 2024, 5:29pm

Portal: a backend for content creators to upload works for fingerprint detection.
Copyright: The automatic protections of original content once the work is fixated in a tangible form of expression, such as computer instructions.
Copyrighted GPT instructions: the carefully constructed language that powers GPT agents, protected under exclusive rights to reproduce or employ, especially for compensation.
Can be protected: use mechanisms, now in place, to prevent reproduction.

Also see the AI’s lack of understanding of fair use for education etc.

As to infringement, “check out” the words from the US Library of Congress site: The pale blue dot : short recording | Library of Congress

MacFarlane on donating the works:

“All I did was write a check, but it’s something that was, to me, worth every penny,” MacFarlane told The Associated Press by phone from Los Angeles. "He’s a man whose life’s work should be accessible to everybody."

Diet · January 12, 2024, 5:29pm

yeah we have to monitor finish reason now too I guess…

Aw man what a headache. But at least good to know. Didn’t think it would concern us. Now I empathize even more with the dating sim folks.

At least the Martin Niemöller quote still works lol

elmstedt · January 12, 2024, 5:42pm

Instructions to a computer are not able to be protected by copyright, similar to recipes.

You should know this.

_j · January 12, 2024, 5:49pm

I’ll allow you time to contemplate…

Microsoft continues to work closely with the U.S. Federal Bureau of Investigation and other law enforcement authorities on this matter. Microsoft source code is both copyrighted and protected as a trade secret. As such, it is illegal to post it, make it available to others, download it or use it. Microsoft will take all appropriate legal actions to protect its intellectual property. These actions include communicating both directly and indirectly with those who possess or seek to possess, post, download or share the illegally disclosed source code.

Diet · January 12, 2024, 5:49pm

[deleted because not constructive, but please reflect on what you said]

elmstedt · January 12, 2024, 5:49pm

No emotions here.

¯⁠\⁠_⁠(⁠ツ⁠)⁠_⁠/⁠¯

Diet · January 12, 2024, 5:51pm

I was probably projecting ¯⁠\⁠_⁠(⁠ツ⁠)⁠_⁠/⁠¯

elmstedt · January 12, 2024, 5:55pm

Contemplated and rejected. I didn’t say source code.

The fact is prompts aren’t protected by copyright. They are akin to a recipe, a series of instructions, and generally lacking the creative expression required to be considered a work of human authorship.

¯⁠\⁠_⁠(⁠ツ⁠)⁠_⁠/⁠¯

Diet · January 12, 2024, 5:59pm

good luck bringing that argument against the intercom bot when your system’s down because your chains keep getting hit with “finish_reason”:“content_filter”

elmstedt · January 12, 2024, 6:00pm

I’m 100% unclear what you’re trying to communicate here.

Topic		Replies	Views
Some questions on copyrighted material Community gpt-4 , chatgpt	26	2526	October 3, 2023
OpenAI’s content filter blocking lyrics content for seemingly no reason API gpt-35-turbo , chatgpt , api , functions	21	5200	December 18, 2023
A site is stealing and duplicating our GPTs - how can we protect our GPTs? GPT builders chatgpt , gpts , gpt	33	3410	April 30, 2024
My GPT with Over 5000 Chats Removed from GPT Store Without Explanation: 1.5 Months of Development and Promotion Wasted Plugins / Actions builders custom-gpt , gpts , gpt , gpt-store , custom-gpts	58	3764	January 19, 2024
Clarifying Content Policy on Discussing Personal Experiences Documentation violations	33	1886	March 21, 2024

Sagan's Blue Dot bug: at least two models refuse to continue the famous quote

Related Topics