It is possible to have better performance by using few-shot prompting with image inputs and structured outputs?

I saw some samples with structured ouputs and few-shot prompting examples to enhance performances on text inputs. What about image inputs? Can I somehow tweak how/what the model see on a image by using few-shot prompting?

Also on the technical side, I searched the implementation of this use case, but I did not find a complete example yet. Digging into a sample implementation I found out that image inputs can be only user messages, while few-shot prompting makes use of system/developer messages. Is it possible technically?