Hi there. I’m trying to fine-tune an AI model’s behavior using few-shot examples in OpenAI’s chat.completions.create endpoint, but I’m struggling to structure it properly.
The goal is to have the model process an input image (that a user would provide) and respond with a specific textual output based on the image. I want to include a few examples to guide the AI’s reasoning and response style.
Does anyone have experience structuring this type of image-to-text few-shot setup? Specifically:
Where should I place the examples for the most effective guidance?
How do I reference image inputs effectively in prompts (e.g., placeholders or descriptions)?
We’ll call it “optimize” instead of fine-tune, as that is a term reserved for further model training.
The current models supporting vision work better when instructed on a task, with examples as part of that instruction-following. However you are prohibited from placing images into a system message.
Therefore, the only thing you can indeed do is to provide user input and assistant responses of the quality and perception you desire. The AI will treat this like chat that actually happened though, and chat (over) trained models will somewhat disregard this, having less propensity to pick up a skill, but rather to answer about them.
You can make one modification, though. That is to use the name field for both roles. Give a name like “exampleIn”, “exampleOut”, and they will be set apart from normal user input and responses. Then you can give system messages that are still instructional instead of relying on pattern training in-context, such as “always follow the style of response demonstrated by assistant:exampleOut messages, placed by the developer”.