-
Uses CLIP gradient ascent to optimize text embeddings for cosine similarity with the input image embeddings → CLIP “opinion” words / text output
-
GradCAM / attention visualization of salient features for both CLIP ViT and ResNet models
-
Gamified: You can guess along with what CLIP was “looking at” by placing a ROI for a given word
-
You can edit CLIP’s opinion and add your own opinion (force your human bias zero-shot choice on the AI ;-))
-
Try using ViT-L/14 - the Stable Diffusion, SDXL, etc. text encoder, or - slightly less ideal outcome - even ViT-B/32, and prompt a text-to-image generative AI with the CLIP opinion. “A CLIP knows what a CLIP sees” (if the models match or are very similar).
-
Could make this a fun network-enabled smartphone app “voting system” for “who can guess correctly what the AI was ‘looking’ at?” + high scores table (gamification) for secondary school - IF ONLY! If only CLIP’s “opinion” didn’t include such inappropriate descriptions of salient features, and in such unpredictable (unexpected) ways. Alas this is more of a “heads up” than an actual implementation suggestion.
Alas, “enjoy responsibly”!