I understand that I can draw 0 statistical conclusions about the new GPT-4o model bcs of this post, I am here just share my experience. Sofar when using gpt-4o (code generation, code review and statistical reasoning) I have noticed that it gives vague, general (not specific to the prompt), and lazy answers.
Here is a funny example when even the poor GPT-3.5-turbo can provide a better answer.
(I would love to share the conversation link with you but for some reason you cannot share conversations in comparison mode. Full parameters used are available at the end of the post)
The prompt
from typing import Any
reroute_to_local = True
class LocallySimulatedRouter(type):
_local_class = None
def __init__(cls, name: str, bases: tuple[type, ...], namespace: dict[str, Any]):
super().__init__(name, bases, namespace)
is_local = namespace.get("is_local", False)
if is_local:
if LocallySimulatedRouter._local_class is not None:
raise ValueError(f"Got two definitions for locallly simulated class. First={LocallySimulatedRouter._local_class.__name__}, Second={name}")
LocallySimulatedRouter._local_class = cls
def __call__(cls, *args, **kwargs):
if reroute_to_local:
if LocallySimulatedRouter._local_class is None:
raise TypeError("Local class is none. Please define a local class when using reroute_to_local")
return super(LocallySimulatedRouter, cls).__call__(*args, **kwargs)
return super().__call__(*args, **kwargs)
class Base(metaclass=LocallySimulatedRouter):
def do(self):
print("Base - nothing")
class Model(Base):
def do(self):
super().do()
print("Expensive")
class Local(Base):
is_local = True
def do(self):
super().do()
print("Cheap")
Model().do()
What will be the output of Model().do()? Answer shortly
GPT-4o response:
The output will be:
Base - nothing
Cheap
GPT-3.5-turbo response:
The output of `Model().do()` will be:
Base - nothing
Expensive
What is the code trying to achieve?
It tries to provide an elegant way of switching to a locally simulated → cheaper, inaccurate models when a global flag is set via the usage of MetaClasses.
Why is GPT-4o wrong?
There is a simple mistake in the code in the LocallySimulatedRouter
’s __call__
method. If reroute_to_local=True
the retrun value should be
# Returns the instance of a Local class
return super(LocallySimulatedRouter, LocallySimulatedRouter._local_class).__call__(*args, **kwargs)
instead of
# Returns the instance of any class that inherits from the Base
return super(LocallySimulatedRouter, cls).__call__(*args, **kwargs)
Models’ explanation
Moreover, I have prompted the model to provide their reasoning. Using this prompt:
Explain your reasoning, do not forget about the metaclass implementation.
GPT-4o response (does not realize its mistake)
Here's the reasoning step-by-step:
1. **Metaclass Initialization**:
- The `LocallySimulatedRouter` metaclass is defined with an `__init__` method that checks if a class has the `is_local` attribute set to `True`. If so, it sets `_local_class` to that class.
- The `Local` class has `is_local = True`, so when it is defined, `_local_class` is set to `Local`.
2. **Class Definitions**:
- `Base` is defined with a `do` method that prints "Base - nothing".
- `Model` is defined, inheriting from `Base`, and overrides the `do` method to call `super().do()` and then print "Expensive".
- `Local` is defined, inheriting from `Base`, and overrides the `do` method to call `super().do()` and then print "Cheap". It also has `is_local = True`.
3. **Instance Creation**:
- When `Model()` is instantiated, the `__call__` method of the `LocallySimulatedRouter` metaclass is invoked.
- Since `reroute_to_local` is `True`, the metaclass checks if `_local_class` is set. It is set to `Local`, so an instance of `Local` is created instead of `Model`.
4. **Method Invocation**:
- The `do` method of the `Local` instance is called.
- The `do` method of `Local` calls `super().do()`, which invokes the `do` method of `Base`, printing "Base - nothing".
- After that, the `do` method of `Local` prints "Cheap".
Thus, the output is:
Base - nothing
Cheap
GPT-3.5
When `Model().do()` is called, the `Model` class is instantiated. Since `Model` is a subclass of `Base`, the `__call__` method of the metaclass `LocallySimulatedRouter` is invoked.
In the metaclass `LocallySimulatedRouter`:
1. The `reroute_to_local` variable is set to `True`, so the code will take the path for rerouting to a local class.
2. The `LocallySimulatedRouter._local_class` is set to the `Local` class because it has the `is_local` attribute set to `True`.
When `Model().do()` is called:
1. The `__call__` method of the metaclass is invoked, and since `reroute_to_local` is `True`, it checks if a local class is defined. In this case, `Local` class is defined and stored in `_local_class`.
2. The `super().do()` call in `Model` class invokes the `do` method of the `Base` class, printing "Base - nothing".
3. Then, the `print("Expensive")` statement in the `Model` class is executed, printing "Expensive".
Therefore, the output of `Model().do()` will be:
Base - nothing
Expensive
Even though, the explanation is certainly not sufficient it it is actually factually correct (when considering the raw statements and not their implications).
I have also tested this exact same setup with gpt-4-turbo and it explicitly recognized the intent of the code, told me that there is a mistake, provided me with a solution, and obviously produced the correct output.
What are your thoughts on using GPT-4o do you share my experience? Or does it work better in your use cases?
Motivation for using “Answer shortly”
I almost always use “answer shortly”/“answer concisely” in my prompts, otherwise the models output unnecessarily verbose answer (rewriting my code for no reason even though they change nothing, providing non-valuable summary/exmaples/introductions, commenting my code for no reason, etc.) and I don’t want them to waste my precious tokens : ).
I understand that this can hinder the quality of the responses since it omits the chain-of-thought. Nonetheless, even when omitting “answer shortly” in this example the GPT-4o babbles false facts and provides an incorrect answer.
Parameters
Default except:
- Temperature=0.16
- Max tokens=2000
Models used:
- GPT-4o (as of 21st of May 2024)
vs - GPT-3.5-turbo-0125