Using GPT (via API) to analyse changes of facial emotion in a video

I read this cookbook that explains that you can do video analysis by extracting its frames then apply image analysis of each frame.

While this allows me to do video-to-text analysis, I wonder how you can understand if and how things have evolved/changed during the video.

For example, I want to analyse a video recording of an interview, and to understand the change of emotional states of the person. How would you approach this?

My thought:

  • chunk the video to frames
  • ask gpt to analyse the emotion of each frame
  • placing the analysis outputs from step 2 into sliding windows, say, 5 frames at a time, then ask gpt how the emotion has changed based on the transcript it generated.

Would this work? Any other ideas?

Thank you