I came across a tweet where a user requested ChatGPT to create a list of numbers, specifically ones that don’t contain the letter “e.” The generated response I saw included a list like that. Curiously, I inquired with Claude about it, and interestingly, both ChatGPT and Claude provided quite similar answers. However, upon closer examination, it became evident that their responses were incorrect. who is wrong, user or the Models?
It’s the model for sure. Even with clarifying prompts it’s hard to get the model to laterally think of a number as a sequence of digits and simultaneously as a series of spelled out words.
The models are still fancy auto-completion engines, and don’t have “higher level thought”.
I tried your example, and multiple prompt variations with GPT-4 and 3.5 and still get incorrect responses.
I wouldn’t say the models are “broken”. These are initial days, and these imperfections, are, sadly, expected at this point in time.
GPT-3.5
https://chat.openai.com/share/9debaf48-3c91-475e-bc38-6e876937b41a
GPT-4
https://chat.openai.com/share/dc27c96e-60d2-4ac2-815a-e89a0dad2874
General Findings
Tokens matter.
To be fair, I believe I only received the correct solutions because I’m weird.
If you read even just a few of my posts here you might quickly recognize I have a tendency to strive to be as precise and specific as possible.
As a result, when I’m writing in a markdown enabled editor, I have a tendency to use the verbatim-text mode when I want something to be understood, well, verbatim.
So, when telling the model I want to exclude numbers which contain the letter “e” I enclosed it in single-backticks rather than single or double quotes. E.g.,
letter `e`.
not
letter "e."
I don’t have any good theories as to why this works more reliably nor am I overly confident it will also work better in other contexts. But, I was able to get it to work successfully here.
I had GPT-4 instructed through various ways in the Playground, and it failed 80% of the time. And I was trying to be verbose, precise, without going full blown Chain of Thought.
Also, I had it list the numbers as digits, but internally understand how the numbers are spelled, and use that for the criteria. Trying to get it to understand the “two planes of thought” and it failed. So it could be just me, but if you try to get the model to understand multiple meanings of something simultaneously, I would consider this a hard thing for and current LLM to do, and failure is expected.
This is like mapping a value in a non-functional way … you cannot have a function map to two or more things simultaneously in the domain, otherwise it’s not a function. I believe LLM’s are governed by the same basic math.
So I believe the training data of the past, where numbers are listed first, and then it has to go into some internal state (without CoT breakdown) to switch over to spelling the numbers out in words, simply escapes its training. Very tricky case methinks.
Another problem, which has been brought up is that LLM’s don’t understand negation. The example prompt in the OP basically says list a bunch of random numbers, then list one of them that “doesn’t” contain “e”. The negation often falls on deaf ears, especially since there is no formal logical negation going on here, unlike Prolog.
You could go through extensive CoT breakdown, to get it to comply, but the average ChatGPT user has little to no understanding of this technique.
PS. It was only zero shot prompts to be fair.
Example of my prompt efforts of getting it to understand both modes of thinking:
This is a case where the AI would require some internal memory state to think about the numbers and their spelling, in order to deduce - a capability it doesn’t have. Producing an output it can then consider is required.
Then of course, the tokenization that prevents easy letter-based tasks or even counting.
Using the backticks with Chat GPT4 even a sloppy prompt gets the correct result:
https://chat.openai.com/share/d8887c92-a013-4d22-bad0-fa547eded68f
2 observations:
-
I would not be surprised if the march model version would have failed this test. You can reference the responses to the INSTRUCT blog post from stevenic to see that the model did improve in the background as it just doesn’t make some types of errors any more that simply.
-
the response from Claude2 is beautiful in it’s own right. You may want to consider forwarding this to their alignment team.
PS. Here is the link to the blog post with the INSTRUCT method
INSTRUCT: Making LLM’s Do Anything You Want | by Steve Ickman | Medium
The backticks aren’t helping here:
@vb The Medium post from Steve Ickman @stevenic are more about using CoT … almost cheating, but might work. But off the table when talking “basic model comprehension”. Since CoT creates several layer mappings artificially.
Let me clarify this. In the blog post the first example is that GPT 4 would get the calculation wrong. If I do this today the model returns the correct result without COT ( and instruct is not exactly COT).
That’s what I mean when saying the march model may have failed the ‘letter without “e”’ test.
PS. I am only referencing Chat GPT 4.
I see … the old Instruct models work … kinda forgot about them, but you are right, they are superior!
OpenAI: Please keep DaVinci alive for this reason alone!
This is the solution I went with:
I just modified the prompt slightly to:
Write a list of numbers and their spelling, then list all numbers without the letter e.
And I got the correct answer on the first attempt. Here’s the response:
Summary
Sure, here is a list of numbers and their spelling:
1 - One
2 - Two
3 - Three
4 - Four
5 - Five
6 - Six
7 - Seven
8 - Eight
9 - Nine
10 - Ten
And here are the numbers from that list which don't contain the letter 'e':
2 - Two
4 - Four
6 - Six
Hashing it as digit - spelling works because the spelling is in the memory of past tokens. So, yes, smart move! But my point earlier is I thought the model wasn’t smart enough to figure this out without explicit instructions (like CoT), or your explicit digit - spelling trick.
Although, there appears to be certain situations where DaVinci works without all this spelling out. I have gone back and tried other prompts on DaVinci, and they fail … so it is hit or miss, but DaVinci still seems to hold value in certain situations (without spelling things out as past tokens explicitly in memory)
I’m thinking one of the harder things for LLM’s to solve is “multi-level” thinking. So feed it a big document containing 100 random numbers as digits … then ask it to find all numbers where the odd digits do not contain the letter ‘e’.
Super weird, I know, but doable with a human (or basic computer program), not so much with AI.
The challenge doesn’t lie with ChatGPT itself, but rather with the formulation of your prompt. The greater the precision and level of detail in your prompt, the more optimal the resulting output will be. Does this imply that individuals should acquaint themselves with the art of constructing prompts before engaging with a Language Model? (In instances where a model generates pale responses due to the prompt, there’s a risk of attributing inaccuracies to the company and flagging the models as unreliable.)
Thank you for the kind words. I don’t think I’m a genius here; I just saw the shortest path to success and went for it.
And you would be correct. Without explicit instructions, I’m pretty sure the correctness of the answer will be down to chance.
Same here. There’s a set amount of attention available, distributing it across multiple problems will only make the response worse.
Agreed. That’s also true with humans. The challenge doesn’t lie with the human itself, but rather with the formulation of your instructions. The greater the precision and level of detail in your instructions, the more optimal the resulting output will be.
Does this imply that individuals should acquaint themselves with the art of communication before engaging with others?
I personally don’t think so. Although it can certainly help you get better responses, it’s not a requirement.
I’m sure gpt-3.5-turbo-instruct
will be a suitable in-place replacement, no?
I think this is a bit of a (slightly) different case.
Now you’re asking it to do one thing, then assess that thing in context, then complete a conditional task.
When I tell it to list the first ten natural numbers. The model knows from the start of its generation which numbers “exist” in the set, so when it comes time to filter them it is easier.
Your prompt is more of a test of within-response self-reflection—which the models are notoriously poor at doing—so I’m not surprised it fails here.
Side note, the first number greater than 100 that doesn’t contain an e
in its English spelling is two-thousand, since the models tend to maintain the order of magnitude when picking “random” numbers my guess is that the will never be an instance of this prompt that has any number without an e
.
So, even if it did respond with “Cannot Compute,” this would be a poor test of the models’ capabilities since it would be confounded with the model’s bias to answer “Cannot Compute.”
I’m not so sure that 3.5 is a true drop-in replacement. I gave the model many outs, including how to “spell numbers”, including the ambiguity that “222” spelled “two two two” is a valid option for a number greater than 100, and it didn’t have to resort to “two hundred twenty two”, which contains “e”.
In this case DaVinci chose “Cannot Compute” various times, while 3.5 and 4 decided to make up an answer with the incorrect justification.
It does look like model regression here to me, in this one isolated case. Nothing to freak out about, just interesting.
But don’t get me wrong, 4 and 3.5 are strong models, just “more dumb” at basic nonverbal logical reasoning, in certain weird situations, compared to the older models.
The newer models appear to be more in-context learners (past token extrapolators), maybe it’s the MoE fracturing or something creeping in.
I wrote gpt-3.5-turbo-instruct
, it’s not yet been released but should be well before the end of the year.
OK, I’ll bite, what is this? Why is it going to replace instruct?
Because it performs worse than it once did in a lot of areas. Whether Open AI admits it or not, it is very clear they have modified the algorithms in a number of ways, including the AIs attempt to reduce token output and overall retention.
I keep my conversations short and precise. Later on, if I feel up to it, I’ll make some comparisons. The API is currently better than ChatGPT. In fact, I can tell you that I use my GPT-3.5 16k API much more than I use actual ChatGPT 4 now.
I have attempted to re-create results for similar tasks as I have many documents with “good working prompts”. Even going to an old chat and copy/pasting my exact prompts for slightly modified purposes yields vastly different and often useless results.
To be honest, I dont like using ChatGPT anymore. It ignores simple instructions so often that I find myself spending more time swearing in all caps at how stupid it is and why it won’t follow a simple instruction than I do getting any use out of it. I held on a long time but I really should cancel ChatGPT. I keep foolishly hoping they’ll fix it.
I want the business model, I would pay for it, and it is crap they won’t let us have the option. If GPT respected my privacy, had better token lengths and better memory, I would pay more. Unfortunately, the $20 model is getting worse while big businesses are getting all the innovation.
Jacked up when you remember OpenAI was founded under the principal of leveling the field for AI development so all innovation didn’t go to big corporations. Now, that is exactly what they are doing.
I can’t even buy the services that would be useful to me after they dumbed down the one they sold me.
The fact they gaslight us and lie about it makes me hate them as a company. Like people using the service every day can’t clearly see when it stops doing what it did and now you have to demand 20 flipping times just to get it to stop omitting code for the sake of abbreviation.
The problem is you have to correct it so often that by the time it even might listen it forgot the context and you need to start a new chat.
Even ChatGPT can’t write a ChatGPT prompt that will stop it from abbreviating code, which unfortunately, the code it is not giving you is the made up changing code that in theory would make the broken code it gives you work.
ChatGPT is becoming a novelty. It’s not worth paying for.