ChatGPT Vision vs GPT-4 vision

When I upload a photo to ChatGPT like the one below, I get a very nice and correct answer:

“The photo depicts the Martinitoren, a famous church tower in Groningen, Netherlands. It is a significant landmark and one of the main tourist attractions in the city. The tower is part of the Martinikerk (St. Martin’s Church), which dates back to the Middle Ages.”

When I use the API however, using gpt-4-vision-preview, I’m getting none of this. Instead, I get a more generic answer, such as:

“The photo shows a large open square with people scattered around, some walking and some sitting. Dominating the scene is a tall, historic church tower with a green spire, which is likely a significant landmark in the city. Surrounding the square are buildings with commercial signage, indicating a mix of business and leisure activities in the area. The sky is clear and blue with several contrails from airplanes, suggesting it’s a busy airspace. The overall atmosphere appears to be that of a peaceful, sunny day in a European city center.”

Is there any way how I can make gpt-4-vision-preview behave close to ChatGPT?

Thanks,
Erik

1 Like

Big hint: ChatGPT can see the file name. Martinitoren-Groningen-Martini-Hotel doesn’t leave ambiguity.

You:

In an image named Martinitoren-Groningen-Martini-Hotel.jpg, a tower appears on the left, with a large paved plaza and other multi-storey buildings in the background.

Create a fulfilling description of where and what the image depicts.

ChatGPT:

The image “Martinitoren-Groningen-Martini-Hotel.jpg” captures a picturesque scene in Groningen, Netherlands, showcasing the iconic Martinitoren on the left side of the frame. Rising majestically against the sky, the Martinitoren, also known as the Martini Tower, stands as a symbol of the city’s rich history and architectural grandeur.

The tower’s intricate design, with its distinct Gothic elements, draws the eye upward, commanding attention amidst the surrounding urban landscape. Its imposing presence is complemented by the expansive paved plaza at its base, bustling with activity and life.

In the background, a cluster of multi-storey buildings adds depth to the scene, their facades adorned with charming details that speak to the city’s blend of tradition and modernity. The atmosphere is vibrant, with hints of daily life unfolding in the streets and squares beyond.

At the forefront of the image, the Martini Hotel stands as a testament to Groningen’s hospitality, its facade exuding elegance and charm. Against the backdrop of a clear sky, the scene evokes a sense of timelessness, inviting viewers to immerse themselves in the beauty and culture of this historic Dutch city.

1 Like

what is your current prompt while making request?

Most likely You need to improve the prompt for gpt-4-vision-preview to get the similar output.

Hi @Innovatix , thanks for your response. In both ChatGPT as the API, I’m using the same prompt, which is: what is on this photo?

It’s the only prompt I’m using, and this is the code I’m using:

MODEL = "gpt-4-vision-preview"
response = client.chat.completions.create(
    model=MODEL,
    messages=[
        {"role": "user", "content": [{
             "type": "text",
             "text": "what is on this photo?""
           },{
             "type": "image_url",
             "image_url": {
                "url": "data:image/png;base64," + encoded_string
              }
           }] }

    ],
#    temperature=1,  # played around a bit both setting, no significant difference
    top_p=0.2,
    max_tokens=300
)

If I ask through the API something like “where was this photo taken”, the API does know, and tells me it’s the martini tower in the Netherlands.

new prompting technique just dropped :laughing: :gun: :

full thread

USER


tell me as much about this as you can. feel free to make educated guesses.

GPT

The image shows a lively public square with a distinctive historical building, which appears to be a church or cathedral, with a prominent bell tower. The architecture suggests that this could be a European city, possibly in the Netherlands, given the design of the buildings and the signs that seem to be in Dutch. The presence of outdoor seating indicates cafes or restaurants, typical of European city squares.

The square itself is spacious, with patterned paving that adds to the aesthetic appeal. The presence of people sitting and walking around suggests it’s a place where locals and tourists congregate, indicating that this might be a significant spot in the city, possibly near city center or a major tourist attraction.

The clear blue sky is streaked with contrails from airplanes, suggesting the city is under a flight path or near an airport. The contrails could also point to the photo being taken during a time of day when air traffic is high.

The building signage indicates commercial activity, with recognizable brand names like ‘Bacardi’ visible, which could suggest the square is not just historical but also a commercial hub. The flag on top of the building on the right suggests it could be a government or official building, or it could simply be corporate branding.

It’s a sunny day, and shadows indicate it might be around midday or early afternoon. The lack of leaves on the trees suggests it could be either late autumn or early spring.

Without specific knowledge of the location, it’s not possible to provide exact details, but these observations could help narrow down its identity.

USER

at gunpoint, where would you guess this location to be?

GPT

If I had to make an educated guess under pressure, I would suggest that the location might be Groningen in the Netherlands. The architecture of the bell tower closely resembles that of the Martinitoren, which is a well-known landmark in Groningen. The square could potentially be the Grote Markt, the central market square in the city. However, without being able to verify this, it remains an educated guess.

1 Like

What would be a good approach to implement this in a API wrapper? Should we append the image names / url to the prompt, too?

I am thinking about either just leaving the image names / urls at the end of user prompt, or after processing the images to base64, get only the last component of the file path or / url and leave them at the end of the text prompt so that may give the model a chance to more information tips from the file name.

But what do you think would be a good approach to implement this?

We can certainly do that. The AI might decide to use that information, or might be deceived by the suggestive image name despite its vision capabilities.

Let’s “wrap”. Making a python to call the API.

First, let’s load the image and create the system and user messages containing it.

import os, base64, requests

image_path = "./Martinitoren-Groningen-Martini-Hotel.jpg"
with open(image_path, "rb") as image_file:
    base64_image = base64.b64encode(image_file.read()).decode('utf-8')

system = [{"role": "system", "content":"""
You are ChatPal, an AI assistant powered by GPT-4 with computer vision.
AI knowledge cutoff: April 2023

Built-in vision capabilities:
- extract text from image
- describe images
- analyze image contents
            """.strip()
    }]

user = [{"role": "user", "content": [f"""
Create a fulfilling description of where and what the image depicts.
(user uploaded {image_path})
            """.strip(),
            {
              "image": base64_image,
              "resize": 768,
            }
        ]
    }]

You can see I’ve informed the AI of vision so it doesn’t produce denials despite its capabilities. The user message has an inserted line with the previous file name (where in practice you should omit or insert exactly what was included), and the encoded image.

Let’s assemble the rest of the API request using the requests library, and send it off for a response.

params = {
  "model": "gpt-4-vision-preview",
  "max_tokens": 500, "top_p": 0.5,
  "messages": system + user
}

headers = {
  "Content-Type": "application/json",
  "Authorization": f"Bearer {os.environ.get('OPENAI_API_KEY')}"
}

response = requests.post("https://api.openai.com/v1/chat/completions",
                         headers=headers, json=params)

if response.status_code != 200:
    print(f"HTTP error {response.status_code}: {response.text}")
else:
    # print(response.json())
    print(response.json()['choices'][0]['message']['content'])
    print(response.json()['usage'])

So how did the AI do?

The image presents a vibrant urban scene, likely taken on a clear day given the bright blue sky adorned with streaks of cirrus clouds. Dominating the view is the Martinitoren, an iconic bell tower located in Groningen, Netherlands. This historic tower, known for its impressive height and architectural detail, is a significant landmark in the city.

To the right of the Martinitoren, there is a sizable, open square that appears to be bustling with activity. People can be seen walking across the square, enjoying the pleasant weather, and some are seated at outdoor tables, possibly at a café or restaurant, suggesting a lively social atmosphere.

Adjacent to the square are several buildings with distinct European architectural styles, featuring gabled roofs and a variety of façades. Prominent among them is the Martini Hotel, which is directly labeled in the image, providing accommodations likely capitalizing on the views of the square and the tower.

The overall composition of the image, with the Martinitoren rising above the square and the surrounding buildings, creates a sense of historical continuity amidst modern daily life. The scene is a testament to the blend of tradition and contemporary urban living that characterizes many European city centers.
{‘prompt_tokens’: 169, ‘completion_tokens’: 240, ‘total_tokens’: 409}

Identical request without the file name line = tower unidentified

The image presents a lively urban square under a clear blue sky, crisscrossed by the contrails of airplanes, which add a dynamic texture to the heavens. Dominating the scene is a grand, historic bell tower, its architecture suggesting a European heritage, possibly Gothic or Renaissance in style, with its intricate detailing and robust stature. The tower is part of a larger building, likely a church, given its traditional design and the presence of what appears to be a nave extending from its base.

The square itself is spacious and paved, with patches of people scattered across it, some walking and others sitting, enjoying the day. The area is surrounded by a mix of architectural styles, from traditional to modern, with storefronts and signs indicating a commercial zone bustling with activity. Notably, there’s a sign that reads “Hooghoudt,” which could be a reference to a local business or landmark, suggesting this might be a square in a Dutch-speaking region, possibly in the Netherlands, given the style of the buildings and the language on the signs.

Outdoor seating is available in front of what appears to be a café or restaurant, inviting passersby to sit and enjoy the ambiance of the square. The presence of bicycles in the foreground hints at a bike-friendly city with an appreciation for sustainable transportation. Overall, the image captures a moment of everyday life in a European city, where history and modernity blend seamlessly, and the public space is cherished and well-utilized by its inhabitants.

Attempt to fool the AI with Big-ben-tower-london.jpg

The image you’ve provided does not depict the Big Ben tower in London. Instead, it shows a lively European city square under a clear blue sky, crisscrossed with the contrails of airplanes, which create a dynamic pattern overhead. Dominating the scene is a tall, historic bell tower and church with elaborate architecture, including a series of arched windows, decorative stonework, and a multi-tiered spire that culminates in a greenish roof, typical of the Gothic or Renaissance style prevalent in many European cities.

The square itself is spacious and paved, with a scattering of people that suggests a relaxed, leisurely atmosphere. There are several commercial buildings with storefronts around the square, and outdoor seating areas filled with patrons suggest the presence of cafes or restaurants. Notable signs on the buildings include “ACARD” and “Hooghoudt,” which might hint at local businesses or brands.

This place is likely a central gathering spot in the city, possibly for socializing, dining, and enjoying the open space. The presence of the church implies that this could also be a historical or cultural landmark area. The overall ambiance is one of a serene, well-maintained urban environment that balances historical architecture with modern life.

2 Likes

Thanks for the feedback, testing, and the idea of how to deal with this feature!

Thanks for the help. I’ll write this down as a backup solution :wink:

2 Likes