Geometric Parametrization fine-tune of ViT-L/14 on CoCo 40k (RTX4090, 20 Epochs, batch_size=40) outperforms OpenAI/ ViT-L/14 accuracy on ImageNet and (partially) mitigates typographic attack vulnerability

Update: Using Long-CLIP [1] (a modification of CLIP ViT-L/14 with expanded embeddings / 248 tokens input), seems to “do the trick” and entirely* eliminate the typographic attack vulnerability after GmP and fine-tuning on CoCo-SPRIGHT-40k with long “spatially right” labels. This was not the case with 77-tokens OpenAI pre-trained ViT-L/14, GmP-fine-tuned ViT-L/14 (albeit it showed signs of going in that direction, see above), nor with the original Long-CLIP by the authors in [1].

*Entirely: With a dozen or so example images-with-text-in-them that I had at hand.

[1] [2403.15378] Long-CLIP: Unlocking the Long-Text Capability of CLIP

Code for replication of results: GitHub - zer0int/Long-CLIP: Scripts for use with LongCLIP, including fine-tuning Long-CLIP

OpenAI’s “apple ipod” and “piggy bank poodle” examples with Long-CLIP vs. GmP-Long-CLIP (includes additional choices predicted by Long-CLIP for the image):

This CLIP knows what you did there with your, quote, CLIP: “absurd apple productfakespeare hoax”. :slight_smile:

1 Like