Hi everyone,
I’m working on a task where the model needs to generate text based on both an image and some input text (an image-text-to-text task).
I want to use few-shot prompting to guide the model, providing examples of the desired input/output behavior. However, I’m running into a conceptual block regarding how to format the examples themselves within the prompt.
Standard few-shot examples are typically text-to-text, which is straightforward to include:
Prompt:
Example 1 Input: [Text Input 1]
Example 1 Output: [Text Output 1]
Example 2 Input: [Text Input 2]
Example 2 Output: [Text Output 2]
Actual Input: [Actual Text Input]
Actual Output:
My challenge is: How do I represent the image part of an image-text pair within these examples in the prompt?
Prompt:
Example 1 Input Image: [??? How to include Image 1 ???]
Example 1 Input Text: [Text Input 1]
Example 1 Output Text: [Text Output 1]
Example 2 Input Image: [??? How to include Image 2 ???]
Example 2 Input Text: [Text Input 2]
Example 2 Output Text: [Text Output 2]
Actual Input Image: [Actual Image ???]
Actual Input Text: [Actual Text Input]
Actual Output Text:
Any insights, examples, or pointers to documentation would be greatly appreciated!
Thanks!