It seems like we’re very close to an AI that has shared comprehension across text, visual (video/image), audio and any number of senses. And a super AI with multiple senses seems exponentially more powerful.
This will be a critical step towards one type of AGI.
Yes, the more multimodal the better. When we can get text, image, video, and sound, all used in training a model, it will be something amazing.