ChatGPT 4o-2024-11-20 cuts 3/4 of the output (2024-08-06 provides all output)

Hello,

I am using Azure OpenAI with the models gpt-4o 2024-08-06 and 2024-11-20.

I have written a small script to compare both models, allowing me to use the same input and configuration for both.

I observed that the 2024-11-20 model drastically shortens the output in terms of tokens (input → output):

• 2024-08-06: ~5963 tokens → ~5595 tokens
• 2024-11-20: ~same input → ~1400 tokens

Additionally, the newer model frequently writes something like: “Text continues with identical edits…” or similar phrasing.

I’m wondering if anyone has any tips or suggestions on how to address this issue.
I am thankful for any tip!

Best regards,
Tim

More details:

Request message:

Headers: {
    "Content-Type": "application/json"
}
Body: {
    "messages": [
        { "role": "system", "content": "..." },
        { "role": "user", "content": "..." }
    ],
    "max_tokens": 16384
}

Response message:

Body: {
    "choices": [
        {
            "content_filter_results": { [ /*all safe and false */ ] },
            "finish_reason": "stop",
            "index": 0,
            "logprobs": null,
            "message": {
                "content": " \n\n...  \n\n[Text continues with identical edits ensuring tense consistency throughout]  ",
                "refusal": null,
                "role": "assistant"
            }
        }
    ],
    "model": "gpt-4o-2024-11-20",
    "object": "chat.completion",
    "prompt_filter_results": [ /*all safe and false */ ],
    "usage": {
        "completion_tokens": 1398,
        "completion_tokens_details": {
            "accepted_prediction_tokens": 0,
            "audio_tokens": 0,
            "reasoning_tokens": 0,
            "rejected_prediction_tokens": 0
        },
        "prompt_tokens": 5963,
        "prompt_tokens_details": {
            "audio_tokens": 0,
            "cached_tokens": 0
        },
        "total_tokens": 7361
    }
}

Welcome to the community!

This stuff has historically always been an issue. It used to be that davinci could blow right past the theoretical context length limits under certain circumstances, to it being artificially restricted, to the models being trained to limit their outputs, to the limit being increased again, etc etc.

I would suggest two things:

  1. try to break down your tasks into things that require responses that are no longer than a couple hundred tokens. In general, the longer the response, the more stability goes out the window.
  2. benchmark for your use case. Typically, (as far as I’m aware) newer models in a certain generation (GPT-4, 4-turbo, 4o) tend to get worse as the version number increases. Sometimes it’s just not worth upgrading until a major version comes along unless there’s a specific feature you absolutely need.

I hope this helps somewhat.

What did you hope to get out of the 11 model over the 08 model?

Hello,

Thank you for your quick response.

I am using the model to fix tense issues in a longer text. I was very excited about the announcement of the 2024-11-20 model and its enhanced creative writing capabilities, so I thought it would be a great fit for this task.

I tried breaking the text into smaller chunks, but the dominant tense within a chunk often leads to misleading results.

Also performing a preliminary analysis of the full text and providing the model with individual chunks didn’t yield the expected results. At times, the model became overly strict or optimized without considering the full context.

Best regards,
Tim

Yeah that’s some goldfishification I’ve also noticed with 4o in general :confused:

While these models have a large nominal context window, they can only deal with a handful of concepts at once, depending on how complicated and spatially diffuse they are. So depending on what you’re trying to do, you might really need to recursively break concepts down in order to fit them into workable attention.

I hope someone who’s into generating long AI fiction can pop in here and help you out. Good luck! :slight_smile: