Image labelling with image-based instructions

I would like to use the API to automatically label my image dataset. I tried with the chat completion API using several prompts but the task is pretty difficult and it would be useful to show some known examples before the image to be annotated. Is is possible to create an assistant that takes a subset of pre-annotated images and labels new ones? If yes, how?

The technique you would want to try is multi-shot learning, using chat completions where you are the one in control.

Besides the system prompt message, you simulate prior user inputs and the desired response for several chat message turns with images before the final one the AI will respond to. That will leave the AI predisposed to follow patterns and techniques it has just observed.

So the endpoint that I should use is chat completion, right?

Yes, you can independently make calls that have the required construction, whereas assistants is designed to maintain a user chat session.

Briefly, multishot would look like this shorthand, where you are providing each example image in the API’s chat message data format:

system: You are a computer vision image analyst. Follow these rules: {blah blah}
user: label this image {image_example1}
assistant: Circuit board, minor scratches, missing component, good condition.
user: label this image {image_example2}
assistant: Broken gear, worn teeth, metal fragments, requires replacement.
user: label this image {image_example3}
assistant: Control panel, cracked screen, intact buttons, slight discoloration.
user: label this image {image_example4}
assistant: Hydraulic pump, oil leak, rusted bolts, operational status unknown.
user: label this image {image_example5}
assistant: Conveyor belt, torn section, frayed edges, motor functional.
user: label this image {image_under_evaluation}

Examples can lessen the prompting work you must do, although the in-context “learning” is of less quality on new overfitted chat models that apply most attention on a new question.

Thanks, I will try that! the only drawback that I see is that for each image under evaluation I must provide all this context, am I right? This means a lot of tokens for a single labeling.

You can set quality:low for some or all images. That costs under 100 tokens per example (plus what you want to demonstrate as response), and they are encoded from a size under 512x512 then.

You cannot use images in a system prompt to give your examples there, and you cannot fine-tune an OpenAI model with images, so if a picture speaks 1000 words of prompt, this is the method left for you.

2 Likes