Multimodal Extractor: Is it currently possible?

Hello everyone,

This is my first topic, and I apologize for my ignorance.

I used OpenAI’s APIs some time ago, but I haven’t revisited them recently and am unaware of their current capabilities. Please forgive me if something similar already exists, but I’m curious to know if there are tools that integrate multimodal extraction, meaning they can work with text, images, audio, and video in a unified way.

If this doesn’t exist yet, I think it would be incredible to develop something like this, especially with an API that allows it to be implemented in different projects. I believe the applications could be vast, ranging from media analysis to scientific research or creative processes.

What do you think? Does something like this exist, or do you think this is something that could be explored in the future? I’d love to hear your ideas and comments.

1 Like

For interleaving all mentioned modalities, learn more in the vision guide, where sequences of periodic images as part of a user message can be employed as video understanding, and the audio guide, with “audio input to model” as a method that can allow answering about that audio or (primarily) treating it as user input. That builds on the text generation capabilities that allow conversation by a user message prompt.

1 Like