Higher level visual sematics extraction for cos sim of images?

weird question: was working on developing a cos sim recall to weigh timestamps, text, and keywords for my agent…
any way in the future to get a visual semantics score for images sent to api? idk something like a score of the meaning of the image, not just basic features of the pixels?

Edit: I’ll try CLIP for now, combined with scoring the descriptions, timestamps, keywords, it’ll prob be good enough for my uses. hmm