Regardless of your size it will tile them (unless it’s already 512x512)
If it’s high-res it will first down-scale it to 512x512 for an overview, then it will resize the high-res image and start slicing/tiling it for inspection.
You can actually see how it will format the image in the pricing calculator:
No wonder even GPT4 can’t properly describe what happened on film.
Show a series of these pictures to a random person and they too will have difficulty describing what happened in the video. There is a key piece of information missing here, and that is that the air conditioner fell out of the window. This is not clear in the pictures, whereas it is clear in the video.
The solution here might be to take frames from the film, not at equal intervals, but from one key event to the next.
@Fusseldieb - I found this while googling, because I had a similar idea to take out scenecuts/keyframes and also add timestamps to the image. The idea with a grid is improving it a lot.
I work on an open source module for the CMS Drupal that has different AI componentets to build full workflows. I just added this as a workflow Video to Text.
For anyone wanting to replicate this, you can see the FFMpeg commands in the file on row 298 here: git. drupalcode .org/project/ai_interpolator_openai/-/blob/1.0.x/src/Plugin/AiInterpolatorFieldRules/OpenAiVideoToText.php?ref_type=heads#L298
(Sorry, I’m not allowed to link yet).
I also created a video of how to setup this using the GUI and an example result here if anyone wants to replicate it:
workflows-of-ai . com/workflow/improve-media-3
It requires you to know how to install Drupal though.
This is a brilliant strategy. Im working on a mobile app that could use this feature and wondered if you can somehow help me implement the code behind your strategy in a buildship node or custom function within openai assistants?
Hi @Meistro ,
I would personally wait until GPT-4o is fully out, so we can take advantage of the new Audio model, which seems, at least from the presentations, worlds better than Whisper.
Are you opposed to helping me anyways? Id like to expedite production and then upgrade model once code structure is in place. just having trouble converting openai tutorial from python to js to work in buildship. Thanks!
Regarding the air conditioner video, I tried explaining it with a total of 23 frames, including keyframes and a few surrounding frames, and it seems to have worked well.
The sequence of images shows a man interacting with a window air conditioning unit. Here’s a breakdown of the key frames and why this might be funny:
First few frames: The man is seen standing next to a window air conditioning unit, seemingly adjusting or inspecting it.
Middle frames: The air conditioning unit suddenly falls out of the window, leaving a gaping hole. The man’s expression and body language suggest surprise and confusion.
Last few frames: The man looks at the empty window frame, then turns away, possibly in disbelief or frustration.
The humor in this situation likely comes from the unexpected and sudden nature of the air conditioning unit falling out of the window. The man’s reaction adds to the comedic effect, as it captures a moment of surprise and helplessness. This type of physical comedy, where something goes wrong in an unexpected way, is a common source of humor.
If anyone wants to know the specific system messages or how I did it, please let me know.
I’ve made a similar example integrating TTS as well to narrate the video and select 5 frames of a video, you can check out the prompts and example I gave: huggingface[dot]/spaces/Pixeltable/AI-Video-Analyzer-GTP4-Vision-TTS-Narration
@dignity_for_all - I would love that. I can’t get 4o to understand the sequence of events here. It thinks curry is about to take a shot. If you can share how you’re helping gpt understand sequence and direction of events etc, that would be helpful!
@pbrunelle - Thank you! This code is very useful downstretch. For now, gpt-4o is struggling with spatial reasoning. It still thinks the warriors are about to score on the frame I shared vs. they are moving the ball down the court.
Ideally for a task like this you wouldn’t dedicate GPT4V to the full task if it doesn’t work out the box. Even if it did, it would probably cost less to just hire someone to do this for you.
You can struggle for a prompt but the reality is that the controls available are hopelessly insufficient.
Instead, the first step to focus on would be object detection for people and the basketball. THEN maybe once you are capable of identifying things you can pass it to GPT4V for easier understanding.
But, as you’ve noted, GPTV and language models in general are not good at spatial reasoning.
In this case, I would focus more on the mapping/trajectories of the basketball, and which team has possession of it.