At least for me, it’s so obvious. Painfully. - I think it would be better to call it “GPT4-Light”.
– Examples at the end –
It has same issues that GPT3.5 has vs GPT3, where GPT3 was much, MUCH better att following instruction than GPT3.5. While GPT3.5 kept falling into the same paths, so to say. I found GPT3.5 quite useless. It felt more like Ada2 with more data.
I see the same happening with GPT4-Turbo. - I would say that additional data has been used to cover up it’s lower ability.
It’s noticeable in ChatGPT too. It either does’t understand OR ignores much of the conversation. It loses track of the context of the chat. I end up having to rewrite everything to one single prompt.
It’s like a person with more facts, but less ability to constructively use it.
This is me venting. Your opinion might differ.
My opinion is based on minimum 2-5kUSD/month. I’m not claiming I know everything.
I think OpenAI need to be more realistic and/or transparent about new models.
A few examples:
(I know it’s an LLM, but I will write from the perspective that is and an actual AI. I think it’s more useful for others then.)
- If generating text, specifying terminology in the prompt or instructing it about certain patterns is less likely to be followed throughout the entire response by GPT4-Turbo than regular GPT4. Basically, it strays off from instructions quite often.
It’s somewhat fixable with overly assertive and repetitive instructions. I.e. If you ask it to “Do THIS thing” but it doesn’t, it does help to repeat it, especiall adding it as a last part of the prompt. - GPT3.5 had exactly the same behaviour.
I would say that GPT4-Turbo is fine for less specific prompts, while GPT4 is better at following instructions more strict.
- Lightly ignoring instuctions. For any amount of complexity, GPT4-Turbo can often decide on one single thing to focus on in the prompt, if the prompt contains multiple requests.
A real example that I actually can reveal in detail: If I asked it to generate HTML + set specific HTML-tag attributes + set the tone/style/etc for the text data it should contain, it would often partially ignore some of this. If hte HTML became great, the text wouldn’t. With GPT4, I did not experience this problem.
What you end up doing is having to turn one single prompt to an itterative process. (Which is kind of ok, but keep in mind I am comparing to the GPT4, which is an older model.)
GPT3.5 had the same thing but worse, where basically it would have a hard time following the multiple nuances that the prompt contained. Forget reliably creating HTML and text.
- Translations. GPT4-Turbo is clearly aware of more terminology in non-English, but it often uses it wrong. You can “fix” this but asking it to output it’s path of reasoning. If you do this, it get’s most thigns almost always right.
So I ended with the Catch 22 of either more accurate usage on terminology, or less accurate but wider vocabulary. There is no clear winner between these two.
I think this might be a good indication of what is going it. It’s as if it does not understand itself or has a lesser scope, even intra-generation/completion.
I think it is a lesser scope issue. Or maybe less weight is put on the instructions. The larger context feels more like a vanity-metric.
- Coding
I made about 15-20 wordpress plugins with GPT4 (I can program, but I am not familiar WP-specific functions etc, so I understand precisely the code it generates.).
GPT4-Turbo is obiously making many more mistakes and “forgets” the context much more. If you ask it to edit existing code that it itself created, GT4-Turbo is much more likely to forget the instructions and choices made before.