CLIP GUI - XAI app ~ explainable (and guessable) AI with CLIP ViT & ResNet models

  • Uses CLIP gradient ascent to optimize text embeddings for cosine similarity with the input image embeddings → CLIP “opinion” words / text output

  • GradCAM / attention visualization of salient features for both CLIP ViT and ResNet models

  • Gamified: You can guess along with what CLIP was “looking at” by placing a ROI for a given word

  • You can edit CLIP’s opinion and add your own opinion (force your human bias zero-shot choice on the AI ;-))

  • Try using ViT-L/14 - the Stable Diffusion, SDXL, etc. text encoder, or - slightly less ideal outcome - even ViT-B/32, and prompt a text-to-image generative AI with the CLIP opinion. “A CLIP knows what a CLIP sees” (if the models match or are very similar).

  • Could make this a fun network-enabled smartphone app “voting system” for “who can guess correctly what the AI was ‘looking’ at?” + high scores table (gamification) for secondary school - IF ONLY! If only CLIP’s “opinion” didn’t include such inappropriate descriptions of salient features, and in such unpredictable (unexpected) ways. Alas this is more of a “heads up” than an actual implementation suggestion.

Alas, “enjoy responsibly”!

1 Like