GPT-3.5-turbo-1106 is worse than 0613 version

After 1 period of use gpt-3.5-turbo-0611, I see that new version is not better than 0613 version. It cannot reply some prompts, reply short response or not exactly, while old versionn can reply exactly with longer response.

I dont know why new version is worse than the old version?

3 Likes

This is highly subjective. For my use-cases it works perfectly. For yours may not (and there is a chance you just need to update your prompts as different model is generally a different “personality” with slightly or significantly different reasoning style).

2 Likes

For me, personally, I have the impression that every time they fine tune the model we lose some inherent intelligence.

But that could just be coincidental I don’t have any evals to support it, I just have a very strong intuition from having made close to 10,000 messages with ChatGPT over the last year that aggressively fine tuning the model like this just to keep it up to date with new information rather than behavioural adaptations e.g. function calling might be detrimental to the model.

1 Like

That might as well be a cognitive distortion.

yes that’s what I mean by coincidental. e.g. every output is different so this could just be making me think it’s a correlation even though it’s just maybe a bad streak at a casino

but I have been very active over a long time, so idk

1 Like

Not that I reject the idea (and I hear it a lot lately), but my personal experience tells the opposite, so as a nerd, I’d love to see actual research before making an opinion :slight_smile:

I can understand the “shorter response” part. I tried to make 1106 to create some long speeches by instructing “write up to 500 tokens” or more. It just didn’t work like it does in 0613.

If enough of us keep using the 0613 version after they make the 1106 default, they may consider prolonging the lifetime of 0613. Since they extended 0301 once.

I’ll be happy if I can keep using 0613 for a year from now on. But six months is still a long time in the IT sector. Something new is bound to come up.

They keep it up to date with new denials about stories your grandmother used to tell you about making napalm. And thousands of other denials. And a whole bunch of other RLHF that is simpleton answers compared to what the AI knows and could answer about cosmology and particle physics.

There needs to be a “remediation team”. Send all the training off to knowledge workers again, starting with that training manually placed by OpenAI. If it has the slightest hint of being a low quality answer or denies, warns, or otherwise refuses to fulfill a user’s request, delete.

Might be impossible to do for 800GB of knowledge which was fed into the model training…

Yes, that’s also part of it; both alignment and this latest trend of new information both add up to some very aggressive fine tuning and it’s very clear that it’s a different application compared to fundamental training where there are some trade-offs, and this might not be getting caught on evals.

https://chat.openai.com/share/92ee32ef-ce7a-4afc-ab0c-cc181d705085
Here’s one example, see how difficult it is for me to “dig up” the latest best practices for programming from its fine-tuning fighting against itself.

Could you summarise what’s wrong there? For somebody who knows nothing about Flutter? :slight_smile:

The last piece of code is correct

class MyCustomWidget extends StatelessWidget {
  MyCustomWidget({super.key});
...

this update was released May 11, 2022

But in order to actually access that data, because it’s not original training but instead applied through fine-tuning, I have go go through many hoops as you can see.

First it gives me the old way (part of its original training data), but then;
Just mentioning the feature was not enough, I had to ask it to list all the new features from that release, and then ask about the feature it listed, and then it was able to answer my original question what's the new recommended version to pass the "key" of a widget to parent in the constructor for dart/flutter?

At least, to me, that tells us something about how “well” the model performs with these updates~ and this may or may not create inconsistent performance depending on the use case I feel like.

1 Like

That’s a really good point! I will bear this in mind when I encounter similar cases to test how the model reacts!

1 Like

In my use case the 1106 is worse than 0613. I’m using it for chinese content extraction and it prefrom worse than 0613 and the lastest one is even worse than 1106. I test it for 100+ article and compare the result with gpt 4, the 0613 version is the best and the new one is almost unuseable.

1 Like

0613 is ranked 25th on the Lymsys leaderboard, whereas 1106 is ranked 46th. Even 0125 and 0314 are ranked higher.

There’s a sample size of about 50,000+ votes. At that size, it’s hardly subjective. Lymsys works on an ELO system, where users pick the better LLM response out of two LLM responses to their queries. The results can be interpreted as 0163 winning more times over other models than 1106 has, implying that 0163 is superior.

Also, 1106 ranks lower than models that I can literally run on my laptop, whereas this is not the case for 0613.

What that doesn’t show is the trend of evaluations.

gpt-3.5-turbo-0613 started off good.

It got progressively worse in major overnight and undocumented steps of changes foisted on the model.

And guess what I just had in ChatGPT with GPT-4. The same manifestation seen in those later alterations to gpt-3.5 besides the inability to follow system instructions: Hung up on prior inputs, unable to complete a new task but instead parroting back a prior code job, trying to answer my new question in the frame of existing context, and then spitting out the same code again, despite clear instructions that old code is no longer under question. To where the session had to be abandoned.