How to best work with 100s of images


My question has been touched upon in different contexts in the forum before. Since links are disallowed, I’ll provide titles of related posts at the bottom.

My main question is:
What is the best way to work with 1000s of images with the GPT4-V API?

As I understand it, there is no inbuilt way to have the model keep track of indices of images sent.
In the post titled " I give 5 images to gpt4-vision and need to identify 2 similar images?, the suggested response is to ask the model to output indices as JSON in the text prompt, like this:

You are a helpful assistant designed to output JSON.  
You will help extract the indices of items in an array based on the ordinal numbers mentioned in a text.

In my experience, this works for a few items but the moment you scale up, ANY LLM starts hallucinating. This feels inherently unreliable to me.

In another post titled “Referring to multiple images in vision API”, the suggested method is to add the name of the image inside the image itself, like so:

As hacky as that seems, this seems to be the most robust solution?

I’m curious if anyone has worked with sending GPT4-V 100s or 1000s of images in a single request and have had success with keeping track of images.