i noticed that with anthropic and deepmind the respective claude and gemini vision model documentations encourage a certain image/text order in prompts.
Claude works best when images come before text. Images placed after text or interpolated with text will still perform well, but if your use case allows it, we recommend image-then-text structure.
so why do the openai api examples show text-then-image order? is that the order that works best, or does it not matter because a certain order is enforced no matter what?
It’s quite possible that 4V and claude v have different architectures. It’s also possible that it’s a training issue. Also note that this is just a recommendation. It’s possible that this is what gave them the best results during internal evaluation.
with 4V you can have a bunch of images in your conversation, but performance can degrade if you have too many, or too many of them are too similar - interestingly, those are the same issues it faces when confronted with redundant text.
Overall, I think the answer to your questions might be, it’s hard to say.
I’d encourage you to evaluate different methods for your use case and figure out what’s best.
My go-to is to introduce the subject context, followed by the actual subject (whether text or image), and then the query/instruction on the subject. that seems to work great for almost all cases.
I’m thinking that working with LLMs is kinda like horse riding.
Is it a science? not really. Can you make a science out of it? sure.
Is it an art? not really. Can you make an art out of it? sure.
Are you a cowboy, or are you doing dressage? Or are you into horse breeding? Which one’s right? Which one’s wrong?
I’d offer that a recommendation is just that - something to get started with. It’s quite possible that a third option might be discovered down the line that works even better, who knows.
to clarify i dont mean ordering images. say if you had one image and a text prompt. could you do text-then-image or image-then-text? if not, then what is the order that is enforced?