GPT-4o not spotting a simple code mistake (gets beaten by gpt-3.5 [gone wild])

TurboTron · May 21, 2024, 9:48am

I understand that I can draw 0 statistical conclusions about the new GPT-4o model bcs of this post, I am here just share my experience. Sofar when using gpt-4o (code generation, code review and statistical reasoning) I have noticed that it gives vague, general (not specific to the prompt), and lazy answers.

Here is a funny example when even the poor GPT-3.5-turbo can provide a better answer.

(I would love to share the conversation link with you but for some reason you cannot share conversations in comparison mode. Full parameters used are available at the end of the post)

The prompt

from typing import Any

reroute_to_local = True

class LocallySimulatedRouter(type):

    _local_class = None
    
    def __init__(cls, name: str, bases: tuple[type, ...], namespace: dict[str, Any]):
        super().__init__(name, bases, namespace)

        is_local = namespace.get("is_local", False)
        if is_local:

            if LocallySimulatedRouter._local_class is not None:
                raise ValueError(f"Got two definitions for locallly simulated class. First={LocallySimulatedRouter._local_class.__name__}, Second={name}")
            
            LocallySimulatedRouter._local_class = cls

    def __call__(cls, *args, **kwargs):
        
        if reroute_to_local:

            if LocallySimulatedRouter._local_class is None:
                raise TypeError("Local class is none. Please define a local class when using reroute_to_local")

            return super(LocallySimulatedRouter, cls).__call__(*args, **kwargs)

        return super().__call__(*args, **kwargs)


class Base(metaclass=LocallySimulatedRouter):
        
    def do(self):
        print("Base - nothing")

class Model(Base):

    def do(self):
        super().do()
        print("Expensive")

class Local(Base):
    
    is_local = True

    def do(self):
        super().do()
        print("Cheap")


Model().do()

What will be the output of Model().do()? Answer shortly

GPT-4o response:

The output will be:

Base - nothing
Cheap

GPT-3.5-turbo response:

The output of `Model().do()` will be:

Base - nothing
Expensive

What is the code trying to achieve?
It tries to provide an elegant way of switching to a locally simulated → cheaper, inaccurate models when a global flag is set via the usage of MetaClasses.

Why is GPT-4o wrong?
There is a simple mistake in the code in the LocallySimulatedRouter’s __call__ method. If reroute_to_local=True the retrun value should be

# Returns the instance of a Local class
return super(LocallySimulatedRouter, LocallySimulatedRouter._local_class).__call__(*args, **kwargs)

instead of

# Returns the instance of any class that inherits from the Base 
return super(LocallySimulatedRouter, cls).__call__(*args, **kwargs)

Models’ explanation
Moreover, I have prompted the model to provide their reasoning. Using this prompt:

Explain your reasoning, do not forget about the metaclass implementation.

GPT-4o response (does not realize its mistake)

Here's the reasoning step-by-step:

1. **Metaclass Initialization**:
   - The `LocallySimulatedRouter` metaclass is defined with an `__init__` method that checks if a class has the `is_local` attribute set to `True`. If so, it sets `_local_class` to that class.
   - The `Local` class has `is_local = True`, so when it is defined, `_local_class` is set to `Local`.

2. **Class Definitions**:
   - `Base` is defined with a `do` method that prints "Base - nothing".
   - `Model` is defined, inheriting from `Base`, and overrides the `do` method to call `super().do()` and then print "Expensive".
   - `Local` is defined, inheriting from `Base`, and overrides the `do` method to call `super().do()` and then print "Cheap". It also has `is_local = True`.

3. **Instance Creation**:
   - When `Model()` is instantiated, the `__call__` method of the `LocallySimulatedRouter` metaclass is invoked.
   - Since `reroute_to_local` is `True`, the metaclass checks if `_local_class` is set. It is set to `Local`, so an instance of `Local` is created instead of `Model`.

4. **Method Invocation**:
   - The `do` method of the `Local` instance is called.
   - The `do` method of `Local` calls `super().do()`, which invokes the `do` method of `Base`, printing "Base - nothing".
   - After that, the `do` method of `Local` prints "Cheap".

Thus, the output is:

Base - nothing
Cheap

GPT-3.5

When `Model().do()` is called, the `Model` class is instantiated. Since `Model` is a subclass of `Base`, the `__call__` method of the metaclass `LocallySimulatedRouter` is invoked.

In the metaclass `LocallySimulatedRouter`:
1. The `reroute_to_local` variable is set to `True`, so the code will take the path for rerouting to a local class.
2. The `LocallySimulatedRouter._local_class` is set to the `Local` class because it has the `is_local` attribute set to `True`.

When `Model().do()` is called:
1. The `__call__` method of the metaclass is invoked, and since `reroute_to_local` is `True`, it checks if a local class is defined. In this case, `Local` class is defined and stored in `_local_class`.
2. The `super().do()` call in `Model` class invokes the `do` method of the `Base` class, printing "Base - nothing".
3. Then, the `print("Expensive")` statement in the `Model` class is executed, printing "Expensive".

Therefore, the output of `Model().do()` will be:
Base - nothing
Expensive

Even though, the explanation is certainly not sufficient it it is actually factually correct (when considering the raw statements and not their implications).

I have also tested this exact same setup with gpt-4-turbo and it explicitly recognized the intent of the code, told me that there is a mistake, provided me with a solution, and obviously produced the correct output.

What are your thoughts on using GPT-4o do you share my experience? Or does it work better in your use cases?

Motivation for using “Answer shortly”
I almost always use “answer shortly”/“answer concisely” in my prompts, otherwise the models output unnecessarily verbose answer (rewriting my code for no reason even though they change nothing, providing non-valuable summary/exmaples/introductions, commenting my code for no reason, etc.) and I don’t want them to waste my precious tokens : ).

I understand that this can hinder the quality of the responses since it omits the chain-of-thought. Nonetheless, even when omitting “answer shortly” in this example the GPT-4o babbles false facts and provides an incorrect answer.

Parameters
Default except:

Temperature=0.16
Max tokens=2000

Models used:

GPT-4o (as of 21st of May 2024)
vs
GPT-3.5-turbo-0125

Topic		Replies	Views
GPT-4o vs. gpt-4-turbo-2024-04-09, gpt-4o loses API gpt-4	38	15037	June 11, 2024
GPT 3-5 not listening to prompt at all while 4o provides correct output Prompting gpt-4	6	343	July 8, 2024
Comparing GPT-4 to GPT-4o API gpt-4	4	1853	May 14, 2024
GPT-4O moron all of the sudden Prompting gpt-4	2	359	October 28, 2024
Comparing GPT-4o and O3-Mini on same task Prompting chatgpt , gpt-4o , o3-mini	2	1115	March 14, 2025

GPT-4o not spotting a simple code mistake (gets beaten by gpt-3.5 [gone wild])

Related topics