Feeding multiple videos in GPT-4o

Hi,

I want to feed 2-4 videos into the same prompt / request to GPT-4o. The videos are short 3-10 seconds each containing 5-20 frames, so it does fit in the context window.
My questions are:

  1. Can I feed multiple videos such that the model understand they are different? (one hack is to overlay text and add skip frames between videos as one contiguous sequence of frames but that’s a little hacky)
  2. How to align the transcriptions text with each video by frame or by second?

There is example in the cookbook just takes a single video as a sequence of frames named “introduction_to_gpt4o”, but I want to feed multiple.