How to Include Image-Text Pairs as Few-Shot Examples in Prompts?

Hi everyone,

I’m working on a task where the model needs to generate text based on both an image and some input text (an image-text-to-text task).

I want to use few-shot prompting to guide the model, providing examples of the desired input/output behavior. However, I’m running into a conceptual block regarding how to format the examples themselves within the prompt.

Standard few-shot examples are typically text-to-text, which is straightforward to include:

Prompt:
Example 1 Input: [Text Input 1]
Example 1 Output: [Text Output 1]
Example 2 Input: [Text Input 2]
Example 2 Output: [Text Output 2]
Actual Input: [Actual Text Input]
Actual Output:

My challenge is: How do I represent the image part of an image-text pair within these examples in the prompt?

Prompt:
Example 1 Input Image: [??? How to include Image 1 ???]
Example 1 Input Text: [Text Input 1]
Example 1 Output Text: [Text Output 1]

Example 2 Input Image: [??? How to include Image 2 ???]
Example 2 Input Text: [Text Input 2]
Example 2 Output Text: [Text Output 2]

Actual Input Image: [Actual Image ???]
Actual Input Text: [Actual Text Input]
Actual Output Text:

Any insights, examples, or pointers to documentation would be greatly appreciated!

Thanks!

Thanks for the reply.

Isn’t there typically an image encoder that creates an image embedding before it is parsed along side the text? If so, passing the base64 image as text to the VLM won’t yield the same results?

The images must be contained in a role message, and only the “user” message is allowed vision input.

Your are not passing the image as text if you use the “type” of the user message part correctly.

It is just a method of providing the data to the API, a data URL, just as an API request with an internet URL would retrieve the image file and place the data, encoded and vectorized, positionally, in to AI context of that message.

messages

  • system:
    • type: text
    • text: you’re a nice bot
  • user:
    • type: text
    • text: look at this picture
    • type: image
    • image: {picture data}
    • type: text
    • text: tell me if im 2 cute

where the user has sent three list items.