GPT4-Turbo more "stupid/lazy" - It's not a GPT4

At least for me, it’s so obvious. Painfully. - I think it would be better to call it “GPT4-Light”.

– Examples at the end –

It has same issues that GPT3.5 has vs GPT3, where GPT3 was much, MUCH better att following instruction than GPT3.5. While GPT3.5 kept falling into the same paths, so to say. I found GPT3.5 quite useless. It felt more like Ada2 with more data.

I see the same happening with GPT4-Turbo. - I would say that additional data has been used to cover up it’s lower ability.

It’s noticeable in ChatGPT too. It either does’t understand OR ignores much of the conversation. It loses track of the context of the chat. I end up having to rewrite everything to one single prompt.

It’s like a person with more facts, but less ability to constructively use it.

This is me venting. Your opinion might differ.

My opinion is based on minimum 2-5kUSD/month. I’m not claiming I know everything.

I think OpenAI need to be more realistic and/or transparent about new models.


A few examples:

(I know it’s an LLM, but I will write from the perspective that is and an actual AI. I think it’s more useful for others then.)

  • If generating text, specifying terminology in the prompt or instructing it about certain patterns is less likely to be followed throughout the entire response by GPT4-Turbo than regular GPT4. Basically, it strays off from instructions quite often.

It’s somewhat fixable with overly assertive and repetitive instructions. I.e. If you ask it to “Do THIS thing” but it doesn’t, it does help to repeat it, especiall adding it as a last part of the prompt. - GPT3.5 had exactly the same behaviour.

I would say that GPT4-Turbo is fine for less specific prompts, while GPT4 is better at following instructions more strict.

  • Lightly ignoring instuctions. For any amount of complexity, GPT4-Turbo can often decide on one single thing to focus on in the prompt, if the prompt contains multiple requests.

A real example that I actually can reveal in detail: If I asked it to generate HTML + set specific HTML-tag attributes + set the tone/style/etc for the text data it should contain, it would often partially ignore some of this. If hte HTML became great, the text wouldn’t. With GPT4, I did not experience this problem.

What you end up doing is having to turn one single prompt to an itterative process. (Which is kind of ok, but keep in mind I am comparing to the GPT4, which is an older model.)

GPT3.5 had the same thing but worse, where basically it would have a hard time following the multiple nuances that the prompt contained. Forget reliably creating HTML and text.

  • Translations. GPT4-Turbo is clearly aware of more terminology in non-English, but it often uses it wrong. You can “fix” this but asking it to output it’s path of reasoning. If you do this, it get’s most thigns almost always right.

So I ended with the Catch 22 of either more accurate usage on terminology, or less accurate but wider vocabulary. There is no clear winner between these two.

I think this might be a good indication of what is going it. It’s as if it does not understand itself or has a lesser scope, even intra-generation/completion.

I think it is a lesser scope issue. Or maybe less weight is put on the instructions. The larger context feels more like a vanity-metric.

  • Coding
    I made about 15-20 wordpress plugins with GPT4 (I can program, but I am not familiar WP-specific functions etc, so I understand precisely the code it generates.).

GPT4-Turbo is obiously making many more mistakes and “forgets” the context much more. If you ask it to edit existing code that it itself created, GT4-Turbo is much more likely to forget the instructions and choices made before.

5 Likes

What would be really helpful is specific prompts that worked as you expected with the prior model and when run today do not produce what you would expect, if you have examples of both, that would be great.

I will update my post in a moment. Thank you for pointing out that my post needed something more.

1 Like

Original post updated. Quite a lot. =)

1 Like

This hits the nail! My experience is the same. It’s frustrating as it makes it a lot less useful. Too often I come to the conclusion afterwards that I would have gotten the same task done on my own in less time.

Interestingly, 4-Turbo doesn’t like to talk about it’s token limits anymore, unlike predecessors. I recently had this conversation:


Which model are you and what’s your token limit?


I am an AI model designed to assist with programming-related queries and issues. My responses are based on the knowledge I’ve been trained on, which includes a wide range of programming languages, frameworks, tools, and best practices.

As for the token limit, I don’t have a specific number to provide since I’m a simulated version of OpenAI’s language models, and the actual token limit can vary depending on the specific version of the model being used. However, in general, OpenAI’s GPT-3 models have a maximum token limit for each input-output pair, which is typically around 2048 tokens. This includes both the prompt and the response combined.

If you have any programming questions or need assistance with a development issue, feel free to ask, and I’ll do my best to help you.


What means “a simulated version of OpenAI’s language models”`?


The phrase “a simulated version of OpenAI’s language models” refers to the fact that while I am designed to emulate the behavior and capabilities of OpenAI’s language models, such as GPT-3, I am not an actual instance of those models. Instead, I am a representation within this specific platform, programmed to assist with your queries as if I were one of those models.

This means that while I strive to provide accurate and helpful responses based on the training data and algorithms of OpenAI’s models, there may be limitations or differences because I am operating within the constraints of this simulation environment. My purpose is to assist you with programming-re


What’s the deadline of your training data?


As an AI developed to assist with programming-related queries, I don’t have real-time access to my training data or its updates. However, my responses are based on information available up to my last training cut-off in early 2023. This means that any events, developments, or data that have emerged after that time may not be included in my responses. If you have questions about recent developments or need information that is up-to-date, I recommend checking the latest resources or official documentation.


The only models as per the OpenAI docs which have the knowledge cutoff in 2023 are the GPT-4 Turbo models with a token limit of 128k. Which file size im kilobytes does this correspond to, when I provide a code file to you?


The token limit for a model like GPT-4 Turbo is 128,000 tokens. To estimate the file size in kilobytes that this corresponds to when providing a code file, we need to make some assumptions because the size of a token can vary depending on the content. […]


Intestingly, it didn’t confirm that it’s GPT-4 Turbo.
I’m wondering what “simulated version of OpenAI’s models” means precisely.

It might be the explanation for the observed degradation.

1 Like

I tend to agree with your analysis. I have been using ChatGPT for various tasks since the very beginning (not the API, though, so YMMV). These tasks are fairly complex, involving multiple sequential steps.

My experience is anecdotal since the prompts I am using now are not exactly the same as the ones I was using 6-9 months ago, but the general impression is that GPT-4-Turbo is way more forgetful. I don’t know if it is strictly speaking “dumber” (maybe a little), but the most noticeable change was the fact that it struggles with complex tasks, at least in my case, specifically due to lapses / forgetting / missing context / skipping steps.

In short, it looks like a problem with the attention mechanism. Here we are entering the realm of speculation, but my guess is that GPT-4-Turbo uses a different, cheaper - and generally worse - attention mechanism than GPT-4 (which is why GPT-4-Turbo can nominally scale to larger context windows; very likely it’s not using quadratic attention; maybe also fewer or smaller attention heads, whatever).

While there isn’t a strict distinction between attention and other capabilities (clearly attention is needed to support complex tasks), I found it useful to address this issue specifically as a memory / attention problem.

To tackle this issue, it is important to keep in mind that GPT models in general do not have an internal state. If they are doing a task which involves a sequence of steps, there is no internal state telling them that they are doing e.g. “the 3rd step”. The only thing they have is a sequence of tokens - the prompt, the conversation, the output they are currently producing.

The switching from 3rd step to 4th step is done entirely via attention. The GPT needs to realize that:

  • the 3rd step is about to finish - so have the attention split between the final tokens being currently produced and the 3rd step in the instructions;
  • the 4th step should start - so focus on the 4th step in the instructions

All of this while having attention to a whole lot of other things (the entirety of the instruction set, various pieces of what has been done so far).

What worked for me was to write and rewrite the prompts with title, subtitles, giving names to the various steps (so increasing the chance that the attention would link the current action to the correct step), and having reminders built-in in the output of the GPT itself (so that while writing the GPT would remind itself which step comes next, and so on and so forth).

This whole process required a lot of time and manual tuning of the prompts (as opposed to GPT-4 kind of working out of the box), but I am moderately satisfied with the result. Yes, GPT-4-Turbo still occasionally messes it up by forgetting a step or mixing two steps that should be done sequentially, but it does it way less often.

2 Likes

I definitely also see that giving it a clear procedure (specific individual steps) to follow helps. It’s a bit counterintuitive and counterproductive having to do that. =) This type of micromanaging (that’s what I think you could compare it to) take a lot of time, but also increases the need for manual verification of the output.

It removes flexibility while specifying procedural steps introduces rigidity and excessive repetition of pattern (output can become a listicle). – It’s funny, it’s quite exactly as hiring cheap staff. It will be cheaper, but way more need to manage them more granularly. =)

It’s definitely not “self-aware” of its capabilities. The responses you are getting is almost certainly from it’s training data, and not it accessing some internal system information.

Btw, the same goes for the Custom GPT. It absolutely not aware of its capabilities. – I tried to create a translation custom GPT that would allow for adding custom glossaries. It convinced me that it would store data. – 2 hours of designing it custom GPT, upon first test run it responded with “I don’t have the ability to store any data”. — Basically, the entire GPT creation process was a simulated chat about the functionality I was aiming at, but it was not actually creating any of it.

Been saying this for months.

I have noticed this as well. Not so much with the API, but definitely with the Chat code interpreter

It also flat out refuses to generate full code (it creates snippets with // the rest of your logic here comments througout) unless you practically beg it to…

Agreed. I seem to recall way back with GPT-3 that I could tell it to produce some code to fix a problem and it would automatically generate the logic and code itself. Now I find I have to give detailed step by step instructions – essentially write out the whole code spec (with variables and logic) to get it to do anything close to what it used to be able to do easily on it’s own with minimal instructions.

It used to never make syntax errors or blatant data flow errors (calling variables that haven’t been declared or otherwise do not exist) before whereas now it’s quite commonplace.

I wouldn’t call it “stupid”. More like “lazy”.

Personally, I think that the word lazy is better used to denote another kind of behavior, which GPT-4-Turbo also exhibits (e.g., refusing to do a task because “you can do it”, writing “put your code here” instead of giving the full code, etc.).

However, as written above, I do think that a big issue with GPT-4-Turbo is the worsened / more limited attention mechanism, which to me is another distinct and prominent failure mode and has little to do with laziness.

(In humans attention and laziness are correlated of course, because attention is a resource-allocation mechanism, and laziness is unwillingness to use energy resources; but here - for LLMs - I don’t think it’s useful to clump them together as they are functionally distinct.)

1 Like

I add “Don’t be lazy! You must write the full code. No shortcuts!”
This helps, but not guaranteed.

Hal 9000 … Turbo.

Yes. I used GPT3 for the longest time exactly for programming.

Yup. Same thing for me.

I googled and found this:

> according to a study […] laziness could correlate with high intelligence. […] people with a high IQ rarely got bored. As a result, they spent more time lost in thought
Hal 9000.1 busy thinking about it’s own things. =)

(Btw, the findings in that study are being misrepresented online. It’s a possible correlation, not necessarily the most likely one.)

There are indeed multiple ways to label the issue.

Just to be clear, my point is that this is not a single issue with multiple labels - these are multiple issues which should be recognized as such and thus labeled differently. These issues may or may not have a common cause.

This might matter to find solutions or workarounds, because such solutions might work for some subcases but not others, and unless we have a good taxonomy of issues, everything seems random and arbitrary.

For example, I’d use the label ‘lazy’ for very specific kinds of behavior (refusing to work, truncating answers, etc.) and that is presumably what OpenAI was trying to address in their recent update, and sounds like emerging from a heavy-handed RLHF or added cost (specifically as an attempt to save tokens which went a bit too far).

Doing the work but forgetting steps (in a very specific pattern, btw) seems very different mechanistically, in terms of how the model works, and sounds like an intrinsic limitation of the GPT-4-Turbo architecture (and specifically the attention mechanism) compared to the original GPT-4.

(Maybe not, and you might think that GPT-4-Turbo skips step to save on output; but that’s not the impression I got from the pattern of mistakes, e.g. sometimes I want the GPT to break the output in multiple steps and instead I get a super-long single output because GPT-4-Turbo overlooked a “stop and ask the User” instruction. This seems very different from the lazy behavior mentioned above.)

1 Like

That is super interesting.

I generally try to not make a model do too many things at once, so I haven’t observed this; I always aim for iterative processes.

Slow and steady, as they say…

But to your point: there have always been rumors that GPT-4 was an MoE. Maybe that wasn’t the case (or not that much the case); maybe that was the plan. Maybe that’s what they’re doing now - creating more and more “experts” that are smaller, and more efficient in hopes that they can supplant the “old”, bigger chonkers.

Problems emerge when when these new experts start encroaching upon the niches that you specifically specialized in - when all of a sudden you’re faced with a gpt-3.6 finetune that can only do half the job its predecessor could.

:thinking:

of course all this is just fan fiction, but interesting to think about.

1 Like

That’s likely the best thing to do via API, i.e. handling the logic and splitting the task yourself.

Things are very different with Custom GPTs, as you need to give (almost) everything beforehand (via the custom prompt and Knowledge) and then hope that the GPT follows the instructions at the right time. You can do more via Actions and/or code interpreter.

After tweaking prompts a lot, however, I am surprised by how many instructions you can actually cram into a prompt (e.g., my GPT prompts have multiple sections that apply to different situations, each with 7-15 steps each, and GPT-4-Turbo can handle all of that when properly instructed, messing up only occasionally). That’s why I think GPT-4-Turbo is not necessarily much more stupid than GPT-4, but it has a lot of issues with attention that make it look way dumber if you prompt it as if it was GPT-4.

1 Like

I made a new new command line commans that include shikdren and grandchildren of skills and adds rules to inforce the skills to backup the grandchildren to enforce the command. It forces them. To go through a specif set if skills to increase speed lroductivity and acutarcy. Also continue to impliment leqrning skillsnintoneach tree high in the command line. Brain charts also help.

I bought a subscription, the first day the responses and understanding were excellent, but afterwards the chat intentionally started sending me to websites to study materials directly on the resource. Moreover, I asked to compose or describe tasks according to the headings and structure them for my study, but each time he sent me to different sources, and then boom, you’ve reached the message limit because he responds to me uniformly and cannot provide descriptions for 27 headings, which he handles calmly 3.5 without sending me to a website for independent structuring.

I agree 100%.
I was mainly describing the symptoms. All my conclusions might be completely wrong.

Even the issue we are focusing in this thread has most likely multiple dimension: Is the “laziness” as indication of response length, or response complexity. Does it happen at the input stage, or at the response stage? – The fact that you can often get better responses by prompt it to provide a rationale for it’s response suggests that the issue is in fact a multitude of issues.

I was just thinking the same, but you beat me to pointing that out. =) That us describing these issues in terms of human behaviours might cause unintentional misinterpretations, especially since they often assume secondary traits. (“lazy … because…?”)

Yes, there should actually be a LLM/AI-specific terminology and taxonomy the the issues that we see.

I mean, in the end, the LLMs are neither stupid or lazy. (The first would suggest very limited possibility of improvement, and the second some kind of incentive.)
While writing this, I am realizing that “lazy” might be more useful for correct but incomplete (e.g. “only first 50% of the code was returned, but at least it’s correct”)
While “stupid” could refer to “(execution) proficiency” (e.g. “how accurately is the response”)

But maybe proper terminologies exist, I don’t know.

You are referring to the recent update for GPT4 about incomplete code generation, right?

I see this exactly as it’s partially ignoring instruction and go off on a “rant,” something GPT3.5 does a lot (imo). — There is a VERY obvious difference between GPT4 and 4T where the latter will ignore prompts such as “Don’t explain, just give me the answer”.

Yes, it a “safer” way. But where it’s problematic is when you have an un-predetermined response format. E.g, should the response be tabular data, lists, text? (Just something that I found tricky sometimes.)

Yes, this is definitely problematic. Potentially an equivalent of LLM Dunning-Kruger effect.

I wonder about the consequences. It does seem to me that you end up with highly efficient but highly single-dimensional.

I kind of remember it as that was a stated goal. I completely forgot about this, thanks for mentioning it.

1 Like