Failed Example of Few-shot-learning using chatGPT-4 Vision on the web app

Hey Guys. I’ve been testing chatGPT-4V capabilities and trying to replicate examples following the “The Dawn of LMMs:
Preliminary Explorations with GPT-4V(ision)” to learn how to use it.

I tried to follow the in-context Few-shot learning example for the speed reader. However, the result was not accurate using two-shots . I would try to do more shots but just wondering if it would be better to use the OpenAI API rather than chatGPT-4 on the web (?).

I reviewed the paper. They demonstrate few-shot learning. You don’t, and can’t in ChatGPT. The closest you can get within the chatbot is to give it tasks it can successfully complete one after another, slowly progressing the difficulty. All paper contributors are within Microsoft, also giving them enhanced access to models, even being acknowledged in the gpt-4-vision attributions.

Few-shot means demonstrating multiple inputs and responses that show and simulate actual interactions as if they happened.

User: What is the capital of Australia?
Assistant: Canberra.
User: Who painted the Mona Lisa?
Assistant: Leonardo da Vinci.
User: What is the chemical symbol for gold?
Assistant: Au.
User: Who discovered penicillin?
Assistant: Alexander Fleming.
User: Hi, do you know who was on the second moon landing?

In completion AI, this would demonstrate the fact and form of a question answering session that already happened, and even train that answers are correct and brief.

It also requires placement of the inputs and the outputs of the same form and container as will be presented for the final question.

This is something you can do on the API - when you send all those inputs at once each as their own role message with images correctly placed.

I zero-shot API AI with prompt, and being a native speaker without unnatural language, unlike the paper, my results didn’t get worse and worse with the addition of more and more text…not better though. I avoided intricately describing the task and what was to be seen.

If OpenAI hadn’t blocked logprobs on the model, more analysis of certainties would be possible.

Input to API

You are ChatPal, an AI assistant powered by GPT-4 with computer vision.
AI knowledge cutoff: April 2023

Built-in vision capabilities:

  • extract text from image
  • describe images
  • analyze image contents
  • logical problem solving requiring reasoning and contextual consideration

In the attached image of a speedometer (which was taken from an instructional video), your goal is to carefully analyze the current speed indicator and the surrounding speed legend, and determine the vehicle speed that is being indicated by the gauge face, interpolating the correct value if the needle end appears between two numbers or the subdivisions within.

Output from GPT-4-vision

– note that the correct speed is 9MPH, or 15km/h –

The speedometer in the image shows two scales: the outer scale is in miles per hour (mph) and the inner scale is in kilometers per hour (km/h). The needle is pointing just past the 20 mph mark, roughly halfway between the 20 and 40 mph marks. Each small subdivision represents 2 mph (since there are 10 small marks between each 20 mph increment). Since the needle is halfway between 20 and 40 mph, we can estimate the speed to be approximately 30 mph. The corresponding speed in km/h would be just past the 40 km/h mark, close to 50 km/h, as the same logic applies to the inner scale.

1 Like

Interested in resurrecting this to see if few shooting a visions model any different than few shooting a language model? Has anyone tried this in I assume the API?