GPT4 Multi-Modality (audio&video)

There is a lot of hype about the GPT4 multi-modal capabilities, but it seems that is/will be limited only to images (i.e. within an image).
My question is - Is there a plan to make GPT4 truly multi-modal (image + audio + video). An example of expected functionality would be:

and then elicit answers based on the information content of the video/audio.

e.g.: What is Sam saying about GPT4 in the video X?
Who’s interviewing SAM?
What is being discussed between min 4 and min 6 of the recording?

Thank you!