GPT-4 Vision Pixel Limitations

I am trying to create a flow, that is using some frames extracted from a video to create a description of the said video, and I was wondering a couple of things:

  • What is the maximum number of pixels, the gpt-4-vision-preview model can handle in a single call? As I have one keymap image, with 320 px width and 51K px height, where all the frames are stitched together, but when provided to OpenAI it got rejected, as I assume I am over the token threshold of 4096 for this model. For the record, I am using the POST /chat/completions call
  • Shall I provide the frames as URLs or as base64 encoded strings? I was wondering if perhaps by providing it that way, I can minimize the consumption of prompt tokens a bit, as the URLs of those pictures is pre-signed, hence lengthy.
  • If I need to provide them in batches, how can I preserve the context provided by the previous batch? Or do I need to pass the description from the previous call(s) and tell ChatGPT that this is a new batch of pictures and it would need to use the previous context to create a second part of the description?

Thanks a lot in advance to everyone who can find some time to answer some/my questions!

What are the pixel limitations when the gpt4v model read an image? I was trying to read and image like this 9933x9934 px and I got an error.

Maybe you hit the file size limitation of 20 MB.
But it’s generally advised to reduce the dimensions of the images before passing them to the model.

You can read up on the process here:
https://platform.openai.com/docs/guides/vision

If you are able to successfully send that by resizing or re-encoding, you should be aware that the image will be resized so that the smallest dimension is no larger than 768px. That means you are basically sending something that will be interpreted at 768x768, and in four detail tiles.

Here’s a snippet for constraining the size and cost, by a maximum dimension of 1024 (where the maximum dimension on a long skinny image like in the first post is normally resized down to 2048).

Thanks a lot for the information. I will check the images and test the example provided by you.