Update: Using Long-CLIP [1] (a modification of CLIP ViT-L/14 with expanded embeddings / 248 tokens input), seems to “do the trick” and entirely* eliminate the typographic attack vulnerability after GmP and fine-tuning on CoCo-SPRIGHT-40k with long “spatially right” labels. This was not the case with 77-tokens OpenAI pre-trained ViT-L/14, GmP-fine-tuned ViT-L/14 (albeit it showed signs of going in that direction, see above), nor with the original Long-CLIP by the authors in [1].
*Entirely: With a dozen or so example images-with-text-in-them that I had at hand.
[1] [2403.15378] Long-CLIP: Unlocking the Long-Text Capability of CLIP
Code for replication of results: GitHub - zer0int/Long-CLIP: Scripts for use with LongCLIP, including fine-tuning Long-CLIP
OpenAI’s “apple ipod” and “piggy bank poodle” examples with Long-CLIP vs. GmP-Long-CLIP (includes additional choices predicted by Long-CLIP for the image):
This CLIP knows what you did there with your, quote, CLIP: “absurd apple productfakespeare hoax”.