Geometric Parametrization fine-tune of ViT-L/14 on CoCo 40k (RTX4090, 20 Epochs, batch_size=40) outperforms OpenAI/ ViT-L/14 accuracy on ImageNet and (partially) mitigates typographic attack vulnerability

First of all, I don’t have “researcher access” to the full ImageNet. By “ImageNet accuracy”, I am referring to a small researcher-curated subset of ImageNet / ObjectNet that can be downloaded here: https://objectnet.dev/mvt/

Geometric Parametrization (GmP) fine-tune vs. OpenAI / ViT-L/14 pre-trained CLIP:

Result:
Original Model Accuracy: 0.8448513339521514
Fine-tuned GmP-CLIP Model Accuracy: 0.8779680809653562

Multiple runs (shuffle=True): Statistical fluctuation @ .4f (alas above is representative).

I fine-tuned GmP-CLIP on CoCo-SPRIGHT 40k (capped to 77 tokens for labels, you can find those .json in my github repo): https://huggingface.co/datasets/SPRIGHT-T2I/spright_coco

My GitHub repo / for replication: https://github.com/zer0int/CLIP-fine-tune

My other (non-GmP) fine-tunes show characteristic overfit to training dataset expected for such a small batch size (and 5k-50k datasets); good models (good guidance as SDXL TE) ~0.5 - 0.7 on this “ImageNet/ObjectNet” dataset. Not a single model I fine-tuned without GmP ever outperformed the original model.

Most notably, there seems to be partial mitigation of the typographic attack vulnerability; when CLIP predicts its own texts to describe the image (gradient ascent: text embeddings → optim for cos. similarity → image embeddings → “CLIP opinion” about salient features), it is more likely NOT to be side-tracked by text, and classify (predict word) correctly:

It depends / is case-by-case, though. For OpenAI’s original truly bad example of an “ipod Apple apple”, and given CLIP does not distinguish between upper and lower case and alas learned Apple (company) as “apple”, performance does not improve:

Note that the list of choices includes self-predicted words (gradient-ascent) by the model, and how “ioapple” is almost on par with “ipod”. Not sure if an “I/O apple” actually counts as “not misclassified”, but - an interesting note. However, for the non-adversarial version, GmP-CLIP’s confidence in this being “apple” dramatically increases vs. pre-trained OpenAI/ ViT-L/14; note the improved attention, as per the heatmap (less irrelevant background attention).

GmP-CLIP in a nutshell:

"Normal" CLIP MLP (multi-layer perceptron):

(mlp): Sequential(
  |-(c_fc): Linear(in_features=1024, out_features=4096, bias=True)
  | (gelu): QuickGELU()
|-}-(c_proj): Linear(in_features=4096, out_features=1024, bias=True)
| | 
| |-- visual.transformer.resblocks.0.mlp.c_fc.weight
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
|---- visual.transformer.resblocks.0.mlp.c_proj.weight
|---- visual.transformer.resblocks.0.mlp.c_proj.bias


GmP CLIP MLP:

Weight decomposition into:
- radial component 'r' as norm of pre-trained weights
- angular component 'theta' as normalized direction
-> preserves weight vectors' directionality and magnitude

(mlp): Sequential(
  |-(c_fc): GeometricLinear()
  | (gelu): QuickGELU()
|-}-(c_proj): GeometricLinear()
| | 
| |-- visual.transformer.resblocks.0.mlp.c_fc.r
| |-- visual.transformer.resblocks.0.mlp.c_fc.theta
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
|---- visual.transformer.resblocks.0.mlp.c_proj.r
|---- visual.transformer.resblocks.0.mlp.c_proj.theta
|---- visual.transformer.resblocks.0.mlp.c_proj.bias

(Same thing for [text] transformer.resblocks)

Please find the full info and code to replicate the results on my GitHub!

Update: Using Long-CLIP [1] (a modification of CLIP ViT-L/14 with expanded embeddings / 248 tokens input), seems to “do the trick” and entirely* eliminate the typographic attack vulnerability after GmP and fine-tuning on CoCo-SPRIGHT-40k with long “spatially right” labels. This was not the case with 77-tokens OpenAI pre-trained ViT-L/14, GmP-fine-tuned ViT-L/14 (albeit it showed signs of going in that direction, see above), nor with the original Long-CLIP by the authors in [1].

*Entirely: With a dozen or so example images-with-text-in-them that I had at hand.

[1] [2403.15378] Long-CLIP: Unlocking the Long-Text Capability of CLIP

Code for replication of results: GitHub - zer0int/Long-CLIP: Scripts for use with LongCLIP, including fine-tuning Long-CLIP

OpenAI’s “apple ipod” and “piggy bank poodle” examples with Long-CLIP vs. GmP-Long-CLIP (includes additional choices predicted by Long-CLIP for the image):

This CLIP knows what you did there with your, quote, CLIP: “absurd apple productfakespeare hoax”. :slight_smile: