Hi – I’m attempting to process as many images per day as I can with the gpt-4v preview model, so I’m attempting to batch as many images as I can into 1 request.
Issue: the model is only recognizing the first 4 images I submit (images 5 and beyond are ignored). For context, it’s ~5k tokens for this request, and each image adds ~1k tokens.
Has anyone had this experience? I can’t find any documentation online.
There’s limitations of the cognition of the AI model.
its a chat model, pretrained to start shortening and limiting output around 700 tokens of text
its vision component is not clearly documented, but larger images are attested to use more “tiles” - by how much you are billed
For some tasks, the amount of attention it can pay or information it can receive seem limited - text recognition will falter 1/3 of the way through, but you can also get starting at the 2/3 point.
more images and it will get confused.
An artificially low max_tokens is placed if you don’t specify, set it more like 1500.
I would first try the detail=low setting, and resize images yourself so the longest dimension is 512 pixels. See if that still has the recognition you are looking for within the resized images, then pass a dozen with request for very short details (like “in a numbered list, tell me how many bananas appear in each image I’ve attached”)
There is only one model offered with computer vision, and it is not the original GPT-4.
Better multimodal model for us would have to rely on OpenAI giving back quality of attention masking and attention layers (at computation expense) nobody that is not an insider has seen.
Batching with a single call doesn’t save much money, but promises a quality reduction. I’d instead go to your rate limits page and see what it would take to increase your tier, or press the “request increase” there. gpt-4-vision-preview is still noted “not for production”.