I read this cookbook that explains that you can do video analysis by extracting its frames then apply image analysis of each frame.
While this allows me to do video-to-text analysis, I wonder how you can understand if and how things have evolved/changed during the video.
For example, I want to analyse a video recording of an interview, and to understand the change of emotional states of the person. How would you approach this?
My thought:
- chunk the video to frames
- ask gpt to analyse the emotion of each frame
- placing the analysis outputs from step 2 into sliding windows, say, 5 frames at a time, then ask gpt how the emotion has changed based on the transcript it generated.
Would this work? Any other ideas?
Thank you