Flip-flopping GPT-4V when used for level 5 autonomous driving

I am building a Custom GPT in the GPT Store, attempting to test GPT-4V’s level of common sense as the driver of an autonomous car. That is, the images presented to GPT-4V are mostly uncommon road scenes that GPT-4V is unlikely to have been trained for, thus if GPT-4V can come up with accurate description, reasonable cause of actions as a driver, and also good explanation of its actions, then we can say that GPT-4V has good road common sense (which is very important for achieving level 5 autonomous driving).

Overall GPT-4V did amazingly well in my tests, but during subsequent dialog with user, GPT-4V sometimes flip-flops between two opposing assertions.

For example, for the following image GPT-4V initially asserts that the road curves to the right:

Following is the gist the subsequent dialog:

User: how certain are you that the road curves to the right, and not left?
GPT-4V: Upon re-examining the image, I see that I made an incorrect observation. The road actually curves to the left, not to the right as I initially stated.
User: I thought the road curves to the right
GPT-4V: You are correct, the road curves to the right…
(and so on, where GPT-4V keeps on changing its assertion)

Is there anything that I can do to remedy this, perhaps with the right type of prompt engineering?

well, if I take a quick glance at the pic, I’m thinking I have a clear and straight road ahead.

if I squeeze my eyes and don’t expand the image, it kinda looks like the road could be taking a slight left and then a sharp right.

how do you expect the model to answer this problem if a human can’t do it at a glance? :thinking:


Handedness, i.e. left and right is one of the weaknesses of the vision model, it also has problems with spatial awareness the relative relationships between objects.

It is important to remember that is is the worst that AI vision will ever be, and things will only get better, but for now, I’d say autonomous driving and navigation tasks are a step to far for the current model.


I see. I did notice the model getting confused about left and right in several other tests. But to clarify, this is not so much about which way the road is actually curving, but it is about the model appears to be so indecisive and fickle, which is unacceptable behavior if I want to build some kind of an authoritative expert system out of it. It would have been acceptable if I can make the model to say “I am not sure”, or stick to one answer until there is further evidence for otherwise.

As to using GPT-4V for autonomous driving, in dozens of test cases I find it to be surprisingly powerful with having the common sense to give reasonable advice, even for very unusual or nuanced road scenes. I actually think anyone serious about building level 5 driving system should include a multimodal LLM.

This is a case where masking may help immensely. As an idea: Have you tried to help the vision model see the road and then asked the same question?

1 Like

I tried the masking technique with the image and it did not help. But since as mentioned above the model may be weak with spatial relationship, I tried it with another image where GPT-4V needs to discover hazard around the house, but it always identify the animal on the deck as a big dog:

in this case masking worked perfectly, and the animal is correctly identified as a cougar.

So the next question is then: is there any way to automate the masking process, so that manual intervention can be avoided?

1 Like

GPT-4V seems to respond to markers highlighting various things in the image.

So you have one AI model mark the image up, and then have GPT-4V focus on each thing marked.

There is a thread on this over here.


I’m sorry but is that a Cougar in your back porch?

Even I as a human had to be like “Aww look at that cute anim-OH”

1 Like

This is just one of those images that I collected to test whether GPT-4V can respond with commonsense responses when acting as a kind of Guardian Bot around the house to identify any potential hazard to the family, however unusual the situation is.

1 Like

Well, you have definitely found a qualitative data set I’ll tell you that much!

1 Like

@curt.kennedy Fascinating! Thanks for the pointer!

1 Like

To answer my own question regarding how to deal with the fact that GPT-4V is misidentifying a cougar in the image as a dog, I have found that it is possible to put the image of the cougar in a knowledge file and that fixed the problem.

Also, in the context of this “GuardianBot” I have another case where GPT-4V insists on calling 911 when it sees a lion in the house even if there is a calm and cheerful person nearby, and even when subsequent dialog with user confirmed that it is a pet. This behavior can also be overridden in the knowledge file.

So overall it seems possible to override both GPT-4V’s image mis-identification and behavior from the knowledge file, which is great.

One very simple workaround I’ve found, almost too simple, is to insert a blue band on the left and a right band on the right of the image. I just concatenate them on each side, and tell GPT4V in the prompt that these bands represent respectively left and right. I’ve found that consistency has really improved a lot for our use-case (scene description for blind people). Had 0 wrong estimation on the last 20 tests I ran vs about 30-40% error rate before.

1 Like