I am trying to create a flow, that is using some frames extracted from a video to create a description of the said video, and I was wondering a couple of things:
What is the maximum number of pixels, the gpt-4-vision-preview model can handle in a single call? As I have one keymap image, with 320 px width and 51K px height, where all the frames are stitched together, but when provided to OpenAI it got rejected, as I assume I am over the token threshold of 4096 for this model. For the record, I am using the POST /chat/completions call
Shall I provide the frames as URLs or as base64 encoded strings? I was wondering if perhaps by providing it that way, I can minimize the consumption of prompt tokens a bit, as the URLs of those pictures is pre-signed, hence lengthy.
If I need to provide them in batches, how can I preserve the context provided by the previous batch? Or do I need to pass the description from the previous call(s) and tell ChatGPT that this is a new batch of pictures and it would need to use the previous context to create a second part of the description?
Thanks a lot in advance to everyone who can find some time to answer some/my questions!
If you are able to successfully send that by resizing or re-encoding, you should be aware that the image will be resized so that the smallest dimension is no larger than 768px. That means you are basically sending something that will be interpreted at 768x768, and in four detail tiles.
Here’s a snippet for constraining the size and cost, by a maximum dimension of 1024 (where the maximum dimension on a long skinny image like in the first post is normally resized down to 2048).