GPT Vision API inconsistently processes multiple images - prompt engineering help needed

I’m working on an object and text recognition project that requires processing batches of multiple images via the Vision API. I’m using presigned URLs from AWS S3, which are all valid and accessible, but I’m consistently running into a puzzling issue:

The API seems to have inconsistent behavior when processing multiple images(usually more than 10):

  • Some prompts consistently process only 10 images (ignoring the rest)
  • Other prompts vary between processing 10, 20, or even incorrectly reporting 100 images
  • The exact same batch of images yields different results depending on prompt wording

This suggests there’s something about the prompt structure itself that affects how many images the model can “see” in a single API call.

Here is what I’ve tried

  1. Explicit counting instructions (“Process all X images…”)
  2. Different output formats (JSON vs. plain text)
  3. Minimal vs. detailed prompts
  4. Explicitly numbering the expected images

Questions for the community

  1. Is there a documented or undocumented limit to how many images can be processed in a single API call?
  2. Have you discovered specific prompt patterns that reliably process larger batches (15-20) of images?
  3. Are there particular phrases or structures that cause the model to “see” more images?
  4. Does anyone have a “perfect prompt” template that consistently processes all submitted images?

I’d greatly appreciate any insights, especially from those who have successfully worked with larger batches of images. This is a critical blocker for our project, and finding a reliable approach would be tremendously helpful.

Thanks in advance!

Welcome to the community!

Is there anything preventing you from sending these images one at a time?

Because that would be the ideal amount of things you process with a single prompt.

1 Like

Thanks!
sending multiple images together will allow for context. Think of a multi-paged scanned hand written letter; the way its written, underlines etc.. in the first page, can give us info that carries on to the next pages..

1 Like

I see.

In my opinion, exhausting the context length is typically a bad idea. I’d advise you to split it as much as possible

But if I were to tackle an all in one problem, I would try the enumeration/anchor method:

You are a good bot, bla bla bla

# 1: sdZ234TX2.png
[image 1 here]

# 2: dsou85sSf.png
[image 2 here]

....

you must abide by the following schema:

{
    "current_index": number, // e.g. 1
    "current_id": string, // file id with extension, e.g.: "sdZ234TX2.png"
    "content": string, // data you extract from this file
    "next_index": number, // the index of the next item to be processed, e.g. if current_index is 1, next_index is 2. If you've arrived at the end, write NaN.
    "next_id": string, // next file id to be processed. If arrived at the end, write "<END>"
}[]

begin your response with [.

you get the gist

but you’ve already tried enumeration - the only difference here is that you force the model to re-enumerate so that the index is always at the forefront of the context. and the random nonce anchors.

One thing to keep in mind is that most of these models have a very limited output length and will try their very best to weasel out of having to generate a long response. I don’t know how much of an issue this still is, but I’ve come to keep my outputs as short as possible.

To hack this a tiny bit you can try to muck around to make the end condition improbable with logprobs or with the prompt:

    "next_index": number, // the index of the next item to be processed, e.g. if current_index is 1, next_index is 2.
    "next_id": string, // next file id to be processed.

With the logprobs it’s a little more difficult, but you’d have to trick the model into beginning a sequence that you then prohibit.

    "next_index": string, // the index of the next item to be processed, e.g. if current_index is "1", next_index is "2". If you've arrived at the end, write "§".
    "next_id": string, // next file id to be processed. If arrived at the end, write "END"

then suppress the probability of § with logprob {18596: -100} (or use the tokenizer for other rare symbols https://platform.openai.com/tokenizer) here’s a guide on that: https://help.openai.com/en/articles/5247780-using-logit-bias-to-alter-token-probability-with-the-openai-api

I hope one of these tricks can help you out.

If you can describe the failure mode of your enumeration attempt, that might also shed some insight into what we could try

1 Like

I haven’t had the chance to implement the function that extracts file names from each received presigned URL, but I understand your approach. Instead, I tried using other enumeration-style prompts, such as asking the model to return something from each file—such as all detected names along with the corresponding file number or index. However, I noticed that the model tends to stop at 10-15 files, even when I provide 20-30 images.

Given that my goal is to identify the X most significant names across all images—where “significant” refers to either well-known individuals (e.g., Albert Einstein = A. Einstein = A.E.) or names that appear frequently in the text—do you think breaking this complex task into smaller subtasks (e.g., perform OCR on all images → extract all names from the text → determine which names are most important) would lead to significantly better and more consistent results? Or is there a way to improve the prompt itself to handle the entire process more effectively?

those filenames are just nonsense, and don’t have to be real. real filenames might actually be counterproductive.

What was the stop reason you got?

It sounds like you might be hitting the length restriction, but for that we’d have to look at the exact inputs and outputs I think.


but to answer the actual question:

My first question would be, how would you - as a human - accomplish this task?

Then I’d try to set up a system where the models can work as you would.

OCR is a good idea you can try, but I think you can do the same thing on a page by page basis.

Sounds like a dumb question, but how do you flip a page? If you’re dealing with a half-finished sentence, how do you deal with that? A half-finished paragraph? A half-finished concept?

Rolling Summaries

One thing here that was commonly done in the early days was to aggregate long chats into rolling summaries.

Basically for you, you’d feed in one (or two) pages at a time, but keep a rolling summary of previous pages as you work your way through the documents.

Evolving Thoughts / Rolling Encyclopedia

Similar to rolling summaries, you can implement a sort of “scratch pad” for the model. Basically you’d ask the model to take specific notes, (e.g. names of people or concepts mentioned in your pages, and how they evolve), that get re-aggregated before re-instantiating with the next (set of) page(s).

Just continue the Conversation

IF you’re dealing with this issue were you’re reaching max output, the easiest solution might be to simply ask the model to continue. You send the output back in as part of the context, and send an automated message to continue.

But again, I’d have to take a look at the inputs and outputs to make a qualified judgement here, but it could be an easy solution you could try.

But to your points:


Breaking stuff up into atomic parts is always a very good idea.

I’d question the last step here. How do you determine which names are most important? It’s critical to keep in mind that models can’t really count. You’d need to work with/around that limitation. But determining famous people? that should be easy. I’d suggest having the model go over every single name and returning whether that person is important (and potentially before asking if they’re important, ask why or why not they would or wouldn’t be important (CoT pattern)).

Hope this helps. If you can share your actual prompt+data it would make it easier :slight_smile:

really? than I did try that in multiple variation, and for all of them it kept returning that it can’t ‘process the images directly’ or something like that.. weird..

no stop reason at all.. it stops there like that’s the complete answer.. I dont know if that indicates anything, but if I add like 20-30 urls and ask it something as simple as ‘how many images did you get?’, it says something like 10-15..

good questions, I assume that the given images are part of a multi-page document, I put them in the payload in the correct order, and in my prompt I do mention that the attachments are of something ‘continuous’, meaning each image has a connection to the others. We wanted to send them all at once in the first place (and not just the text), because sometimes, the visual way something is written sometimes emphasizes importance, for example a name/place written in bold letters/underlined etc… if we send the OCR results only, we will lose these visual features of course.

I send chatGPT the OCR results of 31 images, and asked it this:

Here is the combined text from multiple pages of a document:
<<<LONG TEXT IN SEPERATE TXT FILE>>>

Your task is to extract all person names mentioned in the text. For each name:
1. Provide the name exactly as it appears in the text (no honorifics, e.g., "Mr. Albert Einstein" → "Albert Einstein").
2. Deduplicate exact occurrences and group variations of the same name (e.g., "Einstein, Albert" and "Albert Einstein").
3. Record all locations where the name appears in the text, referencing the image index (e.g., "Page 1, Paragraph 2").

Format the output as:
{
  "names": [
    {
      "name": "preferred_name",
      "variations": ["name_variation_1", "name_variation_2"],
      "occurrences": [
        {"image_index": 1, "text_snippet": "text around name"},
        {"image_index": 2, "text_snippet": "text around name"}
      ]
    },
    ...
  ]
}

then, I took this response and asked it this:

Here is a list of person names extracted from the combined text, along with their occurrences and variations:
<<<INSERT OUTPUT FROM STEP 1>>>

Your task is to identify the 5 most significant persons from this list. Use the following criteria to decide significance:
1. Historical/academic importance of the individual.
2. Appearance across multiple images or prominence in the text.
3. Relevance as identified in recurring mentions or emphasized statements.

For each of the top 5 persons, provide:
1. The preferred name.
2. Justification for their significance (referencing context and occurrences).
3. A count of how many times they appeared across all images and text snippets.
4. The locations where they appeared (by image index and position in text).

Return the result as:
{
  "top_5_significant_persons": [
    {
      "name": "preferred_name",
      "justification": "reason for significance",
      "total_mentions": 5,
      "locations": [
        {"image_index": 1, "text_snippet": "snippet of surrounding text"}
      ]
    },
    ...
  ]
}

and it was actually able to produce a pretty good result! if you’re saying it doesnt know how to count, I dont know how it was able to do such a good job XD

you’re right… it’s late here and all of that work, including the chatGPT chat, are on my work laptop.. I’ll send you everything i can, tomorrow.

thank you so much for your ideas and detailed responses, this really helps, and I hope we’ll figure it out!!