How to do few-shot prompting interweaving text and images with Gpt-4-vision-preview as seen in "The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)"?

I don’t understand how to interweave text and images ( or just ordering them in the prompt) while using the API, especially in a few-shot image & text manner.

I see multiple images can be uploaded but there’s no option to control the ordering of the text in the prompt with the order of the images as seen in “The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)”.

Thanks for the help!

1 Like

For example, looking at this as reference

Perhaps, you can format the message like this

messages=[
    {
      "role": "user",
      "content": [
        {"type": "text", "text": "In the graph, which year has the highest average gas price for the month of June?"},
        {
          "type": "image_url",
          "image_url": {
            "url": "https://.../national_gas_price_comparison_2016-2019.jpg",
          },
         {"type": "text", "text": "This graph is a line plot for national gas price comparison from 2016 until 02/04/2019. The legend on top shows the line color of each year, red (2019), blue (2018), green (2017)  and orange (2016). Since the data is reported until Feb. 2019, only 3 years have datapoints for the month of June, 2018 (blue), 2017 (green) and 2016 (orange). Among them, blue line for 2018 is at the top for the month of June.  Hence, the year with the highest average gas price for the month of June is 2018. "},
        {
          "type": "image_url",
          "image_url": {
            "url": "https://.../national_gas_price_comparison_2015-2018.jpg",
          },
         {"type": "text", "text": "This graph is a line plot for national gas price comparison from 2015 until 12/10/2018. The legend on top shows the line color of each year, red (2018), orange (2017), green (2016)  and orange (2017). Since the data is reported until Dec. 2018, all 4 years have datapoints for the month of June. Among them, red line for 2018 is at the top for the month of June.  Hence, the year with the highest average gas price for the month of June is 2018. "},
        },
        ...
      ],
    }
  ],

Previously, I thought that the sequence of the image and text within the same message content entry does not matter. But it seems to be relevant.

2 Likes

import os
import requests
import base64

Configuration

GPT4V_KEY = “”

headers = {
“Content-Type”: “application/json”,
“api-key”: GPT4V_KEY,
}

Function to encode image to base64

def encode_image_to_base64(image_path):
with open(image_path, “rb”) as img_file:
encoded_image = base64.b64encode(img_file.read()).decode(“utf-8”)
return encoded_image

Payload for the request

payload = {
“model”: “gpt-4-vision-preview”,
“messages”: [
{
“role”: “user”,
“content”: [
{
“type”: “text”,
“text”: “Where is dove light hydration lotion present on the shelf?”
},
{
“type”: “image”,
“image”: {
“base64”: encode_image_to_base64(r"D:\STORE\Retail\20240314_151622.jpg"),
}
},
{
“type”: “text”,
“text”: “It is located on the second shelf in the middle.”
},
{
“type”: “image”,
“image”: {
“base64”: encode_image_to_base64(r"D:\STORE\Retail\20240314_151409.jpg"),
}
},
{
“type”: “text”,
“text”: “It is located on the second shelf at the right.”
},
{
“type”: “text”,
“text”: “Where is dove light hydration lotion present on the shelf?”
},
{
“type”: “image”,
“image”: {
“base64”: encode_image_to_base64(r"D:\STORE\Retail\20240314_151535.jpg"),
}
},
]
}
],
“max_tokens”: 300
}

GPT4V_ENDPOINT = “”

Send request

try:
response = requests.post(GPT4V_ENDPOINT, headers=headers, json=payload)
response.raise_for_status() # Will raise an HTTPError if the HTTP request returned an unsuccessful status code
except requests.RequestException as e:
raise SystemExit(f"Failed to make the request. Error: {e}")

Extracting and printing GPT-4’s response

response_data = response.json()
if ‘choices’ in response_data and response_data[‘choices’]:
gpt4_response = response_data[‘choices’][0][‘message’][‘content’]
print(gpt4_response)
else:
print(“No GPT-4 response found in the API response.”)
Whats wrong in this code? I tried to implement few shot learning but it doesn’t generate a response

Interested in resurrecting this to see if few shooting a visions model any different than few shooting a language model? How were the results? did it actually work?

The results although not accurate but are able to generate better resonses than without few shot. So I’d say it works.

1 Like