ChatGPT 4o-2024-11-20 cuts 3/4 of the output (2024-08-06 provides all output)

Hello,

I am using Azure OpenAI with the models gpt-4o 2024-08-06 and 2024-11-20.

I have written a small script to compare both models, allowing me to use the same input and configuration for both.

I observed that the 2024-11-20 model drastically shortens the output in terms of tokens (input → output):

• 2024-08-06: ~5963 tokens → ~5595 tokens
• 2024-11-20: ~same input → ~1400 tokens

Additionally, the newer model frequently writes something like: “Text continues with identical edits…” or similar phrasing.

I’m wondering if anyone has any tips or suggestions on how to address this issue.
I am thankful for any tip!

Best regards,
Tim

More details:

Request message:

Headers: {
    "Content-Type": "application/json"
}
Body: {
    "messages": [
        { "role": "system", "content": "..." },
        { "role": "user", "content": "..." }
    ],
    "max_tokens": 16384
}

Response message:

Body: {
    "choices": [
        {
            "content_filter_results": { [ /*all safe and false */ ] },
            "finish_reason": "stop",
            "index": 0,
            "logprobs": null,
            "message": {
                "content": " \n\n...  \n\n[Text continues with identical edits ensuring tense consistency throughout]  ",
                "refusal": null,
                "role": "assistant"
            }
        }
    ],
    "model": "gpt-4o-2024-11-20",
    "object": "chat.completion",
    "prompt_filter_results": [ /*all safe and false */ ],
    "usage": {
        "completion_tokens": 1398,
        "completion_tokens_details": {
            "accepted_prediction_tokens": 0,
            "audio_tokens": 0,
            "reasoning_tokens": 0,
            "rejected_prediction_tokens": 0
        },
        "prompt_tokens": 5963,
        "prompt_tokens_details": {
            "audio_tokens": 0,
            "cached_tokens": 0
        },
        "total_tokens": 7361
    }
}

Welcome to the community!

This stuff has historically always been an issue. It used to be that davinci could blow right past the theoretical context length limits under certain circumstances, to it being artificially restricted, to the models being trained to limit their outputs, to the limit being increased again, etc etc.

I would suggest two things:

  1. try to break down your tasks into things that require responses that are no longer than a couple hundred tokens. In general, the longer the response, the more stability goes out the window.
  2. benchmark for your use case. Typically, (as far as I’m aware) newer models in a certain generation (GPT-4, 4-turbo, 4o) tend to get worse as the version number increases. Sometimes it’s just not worth upgrading until a major version comes along unless there’s a specific feature you absolutely need.

I hope this helps somewhat.

What did you hope to get out of the 11 model over the 08 model?

Hello,

Thank you for your quick response.

I am using the model to fix tense issues in a longer text. I was very excited about the announcement of the 2024-11-20 model and its enhanced creative writing capabilities, so I thought it would be a great fit for this task.

I tried breaking the text into smaller chunks, but the dominant tense within a chunk often leads to misleading results.

Also performing a preliminary analysis of the full text and providing the model with individual chunks didn’t yield the expected results. At times, the model became overly strict or optimized without considering the full context.

Best regards,
Tim

Yeah that’s some goldfishification I’ve also noticed with 4o in general :confused:

While these models have a large nominal context window, they can only deal with a handful of concepts at once, depending on how complicated and spatially diffuse they are. So depending on what you’re trying to do, you might really need to recursively break concepts down in order to fit them into workable attention.

I hope someone who’s into generating long AI fiction can pop in here and help you out. Good luck! :slight_smile:

This is an issue that seems very specific to gpt-4o-2024-11-20, is difficult (at best) to mitigate, and makes this version of the model completely unusable for many long-form use cases. I’m rather shocked that this has gotten such little attention.

I noticed it first in an application I created to generate short fiction novels. When the new model version was released, its enhanced creativity was specifically noted, so I was obviously interesting in trying it out in this context. Unfortunately, the first step in the process is to generate a story outline of n chapters in a specific format, but gpt-4o-2024-11-20 will never generate the full outline if n is any higher than 5-6. It will stop generating the outline after chapter 5 or 6 - usually between 1800-2000 tokens (the fact that it seems to always find a good stopping point is noteworthy, I think) and claim space limitations. Here are a few examples:

  • Due to space limitations, the outline is incomplete. Would you like me to continue with Chapters 6-10?
  • (Continued in the next response…)
  • (Chapters 6–10 to follow in continuation due to space constraints.)

I tried a variety of techniques to get it to ignore its hallucinated space constraints, but the best I achieved was about 25% success rate. I did that by adding:

Important Note: If you do not complete a full {{ $chapterCount }} chapter outline in a single response, you will fail. If you fail, I will fail. If you complete the full {{ $chapterCount }} chapter outline in a single response, we will be rewarded.

and

Additional Instructions

Under no circumstances should you promise to finish the outline in a future response. You MUST complete the outline in a single response.

But even then, it would fail most of the time and would end its response with, for example, "(Chapters 7–10 continue in a similar format, ensuring the full outline is completed.) ".

I should note that this is not a gpt-4o-2024-11-20 vs gpt-4o-2024-08-06 issue. Rather, it’s a gpt-4o-2024-11-20 vs literally every other popular model available. One of the features of this app is the ability to select from a variety of models and model providers. With any previous gpt-4o version, gpt-4o-mini, any gemini-pro version, any gemini-flash version, any Claude version, or even most Mistral models selected, they will complete this task successfully ~100% of the time. I even tried it with small, local models. Only gpt-4o-2024-11-20 fails consistently. It’s really bizarre…

1 Like

I want 4o-11-20 output a json structure content which works fine using 4o-08-06 or gpt-4 , but occasionally it produce incomplete json content,eg:

{
    "steps": [
        {
            "index": 1,
            "analyze": "为了去除方程中的分母,我们需要找到分母4和6的最小公倍数。最小公倍数是两个或多个整数共有的倍数中最小的一个。",
            "formulas": [
                {
                    "formula": "4和6的最小公倍数是12",
                    "WolframFormula": "LCM[4, 6]",
                    "result": "12"
                }
            ],
            "result": "最小公倍数为12"
        },
        {
            "index": 2,
            "analyze": "通过乘以分母的最小公倍数,可以消除方程中的分母。因此,原方程两边乘以12,可以得到:12  \\times  [\\frac{x-1}{4} - 1] = 12  \\times  \\frac{3x+1}{6}。",
            "formulas": [
                {
                    "formula": "12  \\times  [\\frac{x-1}{4} - 1] = 12  \\times  \\frac{3x+1}{6}",
                    "WolframFormula": "",
                    "result": ""
                }
            ],
            "result": null
        },
        {
            "index": 3,
            "analyze": "展开并化简方程得:3(\\frac{x-1}{1}) - 12 = 2(\\frac{3x+1}{1})。继续化简方程,得到:3x - 3 - 12 = 6x + 2。",
            "formulas": [
                {
                    "formula": "3(\\frac{x-1}{1}) - 12 = 2(\\frac{3x+1}{1}",
                    "WolframFormula": "",
                    "result": ""
                },
                {
                    "formula": "-12调整poly="
                    } NULL

This was totally wrong that the content can not be parsed with func “json.load”. The phenomenon occurs with a non-negligible frequency.

It looks to me as if the context length of the model has been reduced a lot. It kind of feels like every request is shortened and then send to the model.
So it “forgets” or let’s say doesn’t send the previous messages to the model anymore but a shortened version.
This reduced the capabilities of the model a lot.
Is there a way to remove this “auto summarization” functionality? I would rather have it run into a wall where it says “the overall communication hit the max token limit” so i can do the summarization manually.

And why is it that the model always tries to lecture me instead of doing what it is told. It starts feeling like a learning tool for beginners more and more instead of a tool that you can give tasks to.

Is it because it was trained on academic data? Maybe, just maybe that’s wrong…

Actual work data where the results are produced in group discussions and especially trial and error are a lot better than the summarizations of academic papers where the whole reasoning part (the mistakes made on the way to the result) are not included and where 30% of the papers are straight garbage - produced only because something had to be published…

I mean reasoning is more like “we don’t touch the hot stove because it hurts” and not like “we don’t touch the hot stove because in paper xy it was written that we should avoid the hot stove”