GPT-4 Vision Pixel Limitations

I am trying to create a flow, that is using some frames extracted from a video to create a description of the said video, and I was wondering a couple of things:

  • What is the maximum number of pixels, the gpt-4-vision-preview model can handle in a single call? As I have one keymap image, with 320 px width and 51K px height, where all the frames are stitched together, but when provided to OpenAI it got rejected, as I assume I am over the token threshold of 4096 for this model. For the record, I am using the POST /chat/completions call
  • Shall I provide the frames as URLs or as base64 encoded strings? I was wondering if perhaps by providing it that way, I can minimize the consumption of prompt tokens a bit, as the URLs of those pictures is pre-signed, hence lengthy.
  • If I need to provide them in batches, how can I preserve the context provided by the previous batch? Or do I need to pass the description from the previous call(s) and tell ChatGPT that this is a new batch of pictures and it would need to use the previous context to create a second part of the description?

Thanks a lot in advance to everyone who can find some time to answer some/my questions!