I am building a Custom GPT in the GPT Store, attempting to test GPT-4V’s level of common sense as the driver of an autonomous car. That is, the images presented to GPT-4V are mostly uncommon road scenes that GPT-4V is unlikely to have been trained for, thus if GPT-4V can come up with accurate description, reasonable cause of actions as a driver, and also good explanation of its actions, then we can say that GPT-4V has good road common sense (which is very important for achieving level 5 autonomous driving).
Overall GPT-4V did amazingly well in my tests, but during subsequent dialog with user, GPT-4V sometimes flip-flops between two opposing assertions.
For example, for the following image GPT-4V initially asserts that the road curves to the right:
User: how certain are you that the road curves to the right, and not left? GPT-4V: Upon re-examining the image, I see that I made an incorrect observation. The road actually curves to the left, not to the right as I initially stated. User: I thought the road curves to the right GPT-4V: You are correct, the road curves to the right…
(and so on, where GPT-4V keeps on changing its assertion)
Is there anything that I can do to remedy this, perhaps with the right type of prompt engineering?
Handedness, i.e. left and right is one of the weaknesses of the vision model, it also has problems with spatial awareness the relative relationships between objects.
It is important to remember that is is the worst that AI vision will ever be, and things will only get better, but for now, I’d say autonomous driving and navigation tasks are a step to far for the current model.
I see. I did notice the model getting confused about left and right in several other tests. But to clarify, this is not so much about which way the road is actually curving, but it is about the model appears to be so indecisive and fickle, which is unacceptable behavior if I want to build some kind of an authoritative expert system out of it. It would have been acceptable if I can make the model to say “I am not sure”, or stick to one answer until there is further evidence for otherwise.
As to using GPT-4V for autonomous driving, in dozens of test cases I find it to be surprisingly powerful with having the common sense to give reasonable advice, even for very unusual or nuanced road scenes. I actually think anyone serious about building level 5 driving system should include a multimodal LLM.
I tried the masking technique with the image and it did not help. But since as mentioned above the model may be weak with spatial relationship, I tried it with another image where GPT-4V needs to discover hazard around the house, but it always identify the animal on the deck as a big dog:
This is just one of those images that I collected to test whether GPT-4V can respond with commonsense responses when acting as a kind of Guardian Bot around the house to identify any potential hazard to the family, however unusual the situation is.
To answer my own question regarding how to deal with the fact that GPT-4V is misidentifying a cougar in the image as a dog, I have found that it is possible to put the image of the cougar in a knowledge file and that fixed the problem.
Also, in the context of this “GuardianBot” I have another case where GPT-4V insists on calling 911 when it sees a lion in the house even if there is a calm and cheerful person nearby, and even when subsequent dialog with user confirmed that it is a pet. This behavior can also be overridden in the knowledge file.
So overall it seems possible to override both GPT-4V’s image mis-identification and behavior from the knowledge file, which is great.
One very simple workaround I’ve found, almost too simple, is to insert a blue band on the left and a right band on the right of the image. I just concatenate them on each side, and tell GPT4V in the prompt that these bands represent respectively left and right. I’ve found that consistency has really improved a lot for our use-case (scene description for blind people). Had 0 wrong estimation on the last 20 tests I ran vs about 30-40% error rate before.