How to patch up GPT-4V for image interpretation/reasoning applications

I am looking to use GPT-4V for something that involves the understanding and reasoning over a given image. For example, given the image of a recycling calendar like this:

User may ask “can I recycle cardboard boxes tomorrow?”

So far I have tried it out in the GPT Builder, both under GPT4 and a CustomGPT with GPT4, which gave me different results but both cases are flawed with some errors.

My question here is: what can I do to “patch up” such errors so that the application will behave as expected? Aside from having to wait for the GPT-4V model to become available for fine-tuning,? Are there other more light-weight methods (e.g., in the style of RAG) available through the API?

Hey there and welcome to the community!

So, GPT-4V is not currently equipped to handle this much on its own at this time. GPT-4V is at best an image “classifier”. There are no ways to patch or fine tune the model either.

The best method of approach would be creating such a schedule programmatically (alongside the API) instead of using vision, and then feeding that textual data to the model.

1 Like

The model is borked, broken. It’s emitting python even when told to use gpt-4’s vision skills. Same symptom seen elsewhere - useless python code or other tools emitted despite any instructions.

And then despite the AI then saying it will “analyze by itself” - more garbage code.


So we take this over to the API, where we have an explicit vision model instead of whatever is going on with ChatGPT…


"role": "system",
You are ImageMaster, an AI assistant powered by GPT-4 with computer vision.
- extract text from image
- describe images
- analyze image contents for user
"role": "user",
    """You will use gpt-4-vision skill to analyze the attached image. Today's date is January 30. From the calendar, look at the months, and analyze the symbols within that represent trash days and recycling days. What is the next day that each will be collected?""".strip(),
  "image": base64_image,
  "resize": 2048,

The image shows a calendar for the year 2024 with various symbols indicating trash and recycling collection days. The symbols are as follows:

  • A black trash can symbol represents regular trash days.
  • A yellow recycling symbol represents paper recycling days.
  • A blue recycling symbol represents commingled recycling days.

Today’s date is January 30, 2024. Looking at the calendar for January and February:

  • The next regular trash collection day (black trash can symbol) after January 30 is on January 31.
  • The next paper recycling day (yellow recycling symbol) after January 30 is on February 6.
  • The next commingled recycling day (blue recycling symbol) after January 30 is on February 13.

Please note that the calendar also indicates that if a holiday falls on a collection day, the collection will be the next day. However, there are no holidays indicated between January 30 and the next collection days for each type of waste.
{‘prompt_tokens’: 189, ‘completion_tokens’: 195, ‘total_tokens’: 384}
6.2746076583862305 seconds

You can see that ai misidentified a whole bunch in this image. 365 numbers and a bunch of instructions and symbols is more than the AI can keep straight. There is no ‘patch’, except to divide this into very small tasks, like one month and instructing the symbols to identify and try again.


Thank you to all who replied.

To the OpenAI team: I hope that someday you guys will be able to provide some methods for ‘patching up’ the behavior of a vision model such as GPT-4V. Model fine-tuning is one way, although it is kind of a heavy approach. If there is a solution through the API that overrides then it is even better.

Why ask for such features? This is because for any mission-critical application, even if the model mis-behaves only 0.1% of the time, I may still have to find ways to path it up in order to avoid disaster.

For example, I have a use case where GPT-4V is being used as a vision component in autonomous driving, for the purpose of road scene analysis and offering commonsense recommendation for what to in uncommon situations. GPT-4V does really well for such a task, but I do need to find a way to tweak it when it fails under certain situations, since the consequence could be disastrous.