I am trying to create a flow, that is using some frames extracted from a video to create a description of the said video, and I was wondering a couple of things:
- What is the maximum number of pixels, the
gpt-4-vision-previewmodel can handle in a single call? As I have one keymap image, with 320 px width and 51K px height, where all the frames are stitched together, but when provided to OpenAI it got rejected, as I assume I am over the token threshold of 4096 for this model. For the record, I am using the
- Shall I provide the frames as URLs or as base64 encoded strings? I was wondering if perhaps by providing it that way, I can minimize the consumption of prompt tokens a bit, as the URLs of those pictures is pre-signed, hence lengthy.
- If I need to provide them in batches, how can I preserve the context provided by the previous batch? Or do I need to pass the description from the previous call(s) and tell ChatGPT that this is a new batch of pictures and it would need to use the previous context to create a second part of the description?
Thanks a lot in advance to everyone who can find some time to answer some/my questions!