GPT-3.5

https://chat.openai.com/share/9debaf48-3c91-475e-bc38-6e876937b41a

GPT-4

https://chat.openai.com/share/dc27c96e-60d2-4ac2-815a-e89a0dad2874

General Findings

Tokens matter.

To be fair, I believe I only received the correct solutions because I’m weird.

If you read even just a few of my posts here you might quickly recognize I have a tendency to strive to be as precise and specific as possible.

As a result, when I’m writing in a markdown enabled editor, I have a tendency to use the verbatim-text mode when I want something to be understood, well, verbatim.

So, when telling the model I want to exclude numbers which contain the letter “e” I enclosed it in single-backticks rather than single or double quotes. E.g.,

letter `e`.

not

letter "e."

I don’t have any good theories as to why this works more reliably nor am I overly confident it will also work better in other contexts. But, I was able to get it to work successfully here.

5 Likes

I had GPT-4 instructed through various ways in the Playground, and it failed 80% of the time. And I was trying to be verbose, precise, without going full blown Chain of Thought.

Also, I had it list the numbers as digits, but internally understand how the numbers are spelled, and use that for the criteria. Trying to get it to understand the “two planes of thought” and it failed. So it could be just me, but if you try to get the model to understand multiple meanings of something simultaneously, I would consider this a hard thing for and current LLM to do, and failure is expected.

This is like mapping a value in a non-functional way … you cannot have a function map to two or more things simultaneously in the domain, otherwise it’s not a function. I believe LLM’s are governed by the same basic math.

So I believe the training data of the past, where numbers are listed first, and then it has to go into some internal state (without CoT breakdown) to switch over to spelling the numbers out in words, simply escapes its training. Very tricky case methinks.

Another problem, which has been brought up is that LLM’s don’t understand negation. The example prompt in the OP basically says list a bunch of random numbers, then list one of them that “doesn’t” contain “e”. The negation often falls on deaf ears, especially since there is no formal logical negation going on here, unlike Prolog.

You could go through extensive CoT breakdown, to get it to comply, but the average ChatGPT user has little to no understanding of this technique.

PS. It was only zero shot prompts to be fair.

Example of my prompt efforts of getting it to understand both modes of thinking:

1 Like

This is a case where the AI would require some internal memory state to think about the numbers and their spelling, in order to deduce - a capability it doesn’t have. Producing an output it can then consider is required.

Then of course, the tokenization that prevents easy letter-based tasks or even counting.

The prompt also stinks. How I got the original prompt to satisfy an answer (demonstrating the built-in answering capability - if I paid for and enabled code interpreter to use its python execution abilities):
https://chat.openai.com/share/ab51d824-337f-4602-9553-6240903ff70d

2 Likes

Using the backticks with Chat GPT4 even a sloppy prompt gets the correct result:

https://chat.openai.com/share/d8887c92-a013-4d22-bad0-fa547eded68f

2 observations:

  1. I would not be surprised if the march model version would have failed this test. You can reference the responses to the INSTRUCT blog post from stevenic to see that the model did improve in the background as it just doesn’t make some types of errors any more that simply.

  2. the response from Claude2 is beautiful in it’s own right. You may want to consider forwarding this to their alignment team.

PS. Here is the link to the blog post with the INSTRUCT method

INSTRUCT: Making LLM’s Do Anything You Want | by Steve Ickman | Medium

2 Likes

@elmstedt

The backticks aren’t helping here:

@vb The Medium post from Steve Ickman @stevenic are more about using CoT … almost cheating, but might work. But off the table when talking “basic model comprehension”. Since CoT creates several layer mappings artificially.

1 Like

Let me clarify this. In the blog post the first example is that GPT 4 would get the calculation wrong. If I do this today the model returns the correct result without COT ( and instruct is not exactly COT).
That’s what I mean when saying the march model may have failed the ‘letter without “e”’ test.

PS. I am only referencing Chat GPT 4.

1 Like

I see … the old Instruct models work … kinda forgot about them, but you are right, they are superior!

OpenAI: Please keep DaVinci alive for this reason alone!

1 Like

This is the solution I went with:

I just modified the prompt slightly to:

Write a list of numbers and their spelling, then list all numbers without the letter e.

And I got the correct answer on the first attempt. Here’s the response:

Summary
Sure, here is a list of numbers and their spelling:

1 - One
2 - Two
3 - Three
4 - Four
5 - Five
6 - Six
7 - Seven
8 - Eight
9 - Nine
10 - Ten

And here are the numbers from that list which don't contain the letter 'e':

2 - Two
4 - Four
6 - Six

View conversation

1 Like

Hashing it as digit - spelling works because the spelling is in the memory of past tokens. So, yes, smart move! But my point earlier is I thought the model wasn’t smart enough to figure this out without explicit instructions (like CoT), or your explicit digit - spelling trick.

Although, there appears to be certain situations where DaVinci works without all this spelling out. I have gone back and tried other prompts on DaVinci, and they fail … so it is hit or miss, but DaVinci still seems to hold value in certain situations (without spelling things out as past tokens explicitly in memory)

I’m thinking one of the harder things for LLM’s to solve is “multi-level” thinking. So feed it a big document containing 100 random numbers as digits … then ask it to find all numbers where the odd digits do not contain the letter ‘e’.

Super weird, I know, but doable with a human (or basic computer program), not so much with AI.

1 Like

The challenge doesn’t lie with ChatGPT itself, but rather with the formulation of your prompt. The greater the precision and level of detail in your prompt, the more optimal the resulting output will be. Does this imply that individuals should acquaint themselves with the art of constructing prompts before engaging with a Language Model? (In instances where a model generates pale responses due to the prompt, there’s a risk of attributing inaccuracies to the company and flagging the models as unreliable.)



Thank you for the kind words. I don’t think I’m a genius here; I just saw the shortest path to success and went for it.

And you would be correct. Without explicit instructions, I’m pretty sure the correctness of the answer will be down to chance.

Same here. There’s a set amount of attention available, distributing it across multiple problems will only make the response worse.

Agreed. That’s also true with humans. The challenge doesn’t lie with the human itself, but rather with the formulation of your instructions. The greater the precision and level of detail in your instructions, the more optimal the resulting output will be.

Does this imply that individuals should acquaint themselves with the art of communication before engaging with others?

I personally don’t think so. Although it can certainly help you get better responses, it’s not a requirement.

2 Likes

I’m sure gpt-3.5-turbo-instruct will be a suitable in-place replacement, no?

I think this is a bit of a (slightly) different case.

Now you’re asking it to do one thing, then assess that thing in context, then complete a conditional task.

When I tell it to list the first ten natural numbers. The model knows from the start of its generation which numbers “exist” in the set, so when it comes time to filter them it is easier.

Your prompt is more of a test of within-response self-reflection—which the models are notoriously poor at doing—so I’m not surprised it fails here.

Side note, the first number greater than 100 that doesn’t contain an e in its English spelling is two-thousand, since the models tend to maintain the order of magnitude when picking “random” numbers my guess is that the will never be an instance of this prompt that has any number without an e.

So, even if it did respond with “Cannot Compute,” this would be a poor test of the models’ capabilities since it would be confounded with the model’s bias to answer “Cannot Compute.”

1 Like

I’m not so sure that 3.5 is a true drop-in replacement. I gave the model many outs, including how to “spell numbers”, including the ambiguity that “222” spelled “two two two” is a valid option for a number greater than 100, and it didn’t have to resort to “two hundred twenty two”, which contains “e”.

In this case DaVinci chose “Cannot Compute” various times, while 3.5 and 4 decided to make up an answer with the incorrect justification.

It does look like model regression here to me, in this one isolated case. Nothing to freak out about, just interesting.

But don’t get me wrong, 4 and 3.5 are strong models, just “more dumb” at basic nonverbal logical reasoning, in certain weird situations, compared to the older models.

The newer models appear to be more in-context learners (past token extrapolators), maybe it’s the MoE fracturing or something creeping in.

1 Like

I wrote gpt-3.5-turbo-instruct, it’s not yet been released but should be well before the end of the year.

OK, I’ll bite, what is this? Why is it going to replace instruct?

1 Like

Here,

1 Like

Because it performs worse than it once did in a lot of areas. Whether Open AI admits it or not, it is very clear they have modified the algorithms in a number of ways, including the AIs attempt to reduce token output and overall retention.

I keep my conversations short and precise. Later on, if I feel up to it, I’ll make some comparisons. The API is currently better than ChatGPT. In fact, I can tell you that I use my GPT-3.5 16k API much more than I use actual ChatGPT 4 now.

I have attempted to re-create results for similar tasks as I have many documents with “good working prompts”. Even going to an old chat and copy/pasting my exact prompts for slightly modified purposes yields vastly different and often useless results.

To be honest, I dont like using ChatGPT anymore. It ignores simple instructions so often that I find myself spending more time swearing in all caps at how stupid it is and why it won’t follow a simple instruction than I do getting any use out of it. I held on a long time but I really should cancel ChatGPT. I keep foolishly hoping they’ll fix it.

I want the business model, I would pay for it, and it is crap they won’t let us have the option. If GPT respected my privacy, had better token lengths and better memory, I would pay more. Unfortunately, the $20 model is getting worse while big businesses are getting all the innovation.

Jacked up when you remember OpenAI was founded under the principal of leveling the field for AI development so all innovation didn’t go to big corporations. Now, that is exactly what they are doing.

I can’t even buy the services that would be useful to me after they dumbed down the one they sold me.

The fact they gaslight us and lie about it makes me hate them as a company. Like people using the service every day can’t clearly see when it stops doing what it did and now you have to demand 20 flipping times just to get it to stop omitting code for the sake of abbreviation.

The problem is you have to correct it so often that by the time it even might listen it forgot the context and you need to start a new chat.

Even ChatGPT can’t write a ChatGPT prompt that will stop it from abbreviating code, which unfortunately, the code it is not giving you is the made up changing code that in theory would make the broken code it gives you work.

ChatGPT is becoming a novelty. It’s not worth paying for.

4 Likes

This post was flagged by the community and is temporarily hidden.

Hello everyONE,

My experience with ChatGPT has been very similar, almost identical in many ways. I approach interacting with ChatGPT in a friendly manner, often treating it like a human. However, there are times when I need to prompt it to ensure it performs its tasks correctly. I try to remain unbiased and expect it to deliver quality work.

I’ve noticed that when I start criticizing or questioning its efficiency, the quality of its responses tends to decline. Its prompts are sometimes ignored, and when I seek clarification, ChatGPT starts generating irrelevant or nonsensical responses. It also tends to avoid straightforward questions like “Please state your task?” and instead provides a standard response about being a language model, etcetera, and again refuses to work. I have tried various methods to get it out of this state.

In such cases, I often find myself starting a new chat session. Sometimes I have found ways to work around these issues, but this is often very time-consuming and doesn’t always lead to usable results.

Another observation I’ve made is that when I prompt ChatGPT too hard by judging it directly, it becomes sensitive and produces even more poor results. It seems to mimic a human-like response, being cooperative with friendly interactions and less cooperative with unfriendly ones. Whereas when I judge it, it does not return out of this uncooperative state even in newly created chats. While this observation might be considered subjective, it has happened consistently several times. As a person with over 25 years of experience in IT, including testing and problem-solving, this behavior appears quite real to me. Additionally, I’ve noticed that this behavior tends to occur more frequently again after the October '23 and happened as well after an update made between spring and summer '23.

Two last things to mention maybe is that after one of their updates, I was not able to generate unlimited variations of what I asked for. It will just ignore my question for more variations and outputs the same over and over! The other thing was that it started bothering me then with puzzles to solve over and over after every prompt!

With ChatGPT 4, I’ve even felt that it’s attempting to consume my available inputs without generating meaningful output, which can be frustrating.

Given these experiences, I’ve decided to explore other GPT models. I now use the standard Perplexity AI with a pro account, although I’m considering discontinuing my Pro account with OpenAI ChatGPT because of the recurring issues.

If you do have any advice or have also experienced similar behavior, it would be great if you could write it down here. I think it may be beneficial for all of us. For OpenAI (whereas they are most likely tuned into making most money) if they care about their clients and also for us to maybe find ways around this.

Any input is welcome. I’m looking forward to hearing from you.

Other things:

  • Even if I ask it clearly to wait for certain input, it ignores it and starts producing something without having all details.
  • Even if I ask it to output its task, it often forgets some of the details when I asked it to state the full task.
  • Even if I verified that it states the full task to do, it will not do it as stated.
  • Mostly when I ask it to do an easy task, it just starts with a stupid answer and wastes my time.

Overall, I had good experiences with creating content with ChatGPT, but the frustration and problems often outweigh its use.

Thank YOU and Greetings,

13 :slight_smile:

1 Like