Gpt-4 vision few shot prompting with images

Can someone provide a complete code with libraries to demonstrate few shot prompting with images? Here’s the use case I’m focusing on. I want to prompt the model with a pair of reference and shelf image providing the description. So next when the user uploads new pairs of reference shelf images the model is able to generate similar descriptions.

1 Like

A single user message that includes a few images will look like:

user_message = [ { "role": "user", "content": [ { "type": "text", "text": "describe these two images", }, { "type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{base64_image1}", "detail": "low"} }, { "type": "image_url", "image_url": {"url": f"data:image/png;base64,{base64_image2}", "detail": "high"} } ] } ]

Past turns of conversation that have the user image input and the AI assistant answer can be exactly the same as first sent if you want the AI to continue seeing the contents of the images in past messages.

Chat history can be the same preservation of information as you would send in a normal chat (at higher expense with images), or you can start to expire and remove the actual images from being sent again after a limited number of future turns.

The style of response the AI writes is unlikely to vary much if you are having it perform similar tasks again. It is only when the older image content is still important that you’d send them again, otherwise, old images would be a distraction.

2 Likes

But where you are labelling the image few shot means images and its output how we want

Multi-shot means training an AI on a pattern in the context you send, where it can learn the desired output.

Just the unseen prompts of “user” and “assistant” are one such pattern when you send by chat completions, and that break in the language to signify a different respondent is necessary when using a base AI.

A example set of training examples one might sent to the AI to show it the responses desired from messages that include training images:

user: safe? (picture of lion)
assistant: {“safety”: “unsafe”}
user: safe? (picture of baby)
assistant: {“safety”: “safe”}
user: safe? (picture of blowtorch)
assistant: {“safety”: “unsafe”}
user: safe? (picture of school glue)
assistant: {“safety”: “safe”}

By the previous conversation, in this case also using stock images, a new user input in the same format can evoke the same response and decision-making.

Chat models are now overtrained and can barely learn from context.