First of all, I don’t have “researcher access” to the full ImageNet. By “ImageNet accuracy”, I am referring to a small researcher-curated subset of ImageNet / ObjectNet that can be downloaded here: https://objectnet.dev/mvt/
Geometric Parametrization (GmP) fine-tune vs. OpenAI / ViT-L/14 pre-trained CLIP:
Result:
Original Model Accuracy: 0.8448513339521514
Fine-tuned GmP-CLIP Model Accuracy: 0.8779680809653562
Multiple runs (shuffle=True): Statistical fluctuation @ .4f (alas above is representative).
I fine-tuned GmP-CLIP on CoCo-SPRIGHT 40k (capped to 77 tokens for labels, you can find those .json in my github repo): https://huggingface.co/datasets/SPRIGHT-T2I/spright_coco
My GitHub repo / for replication: https://github.com/zer0int/CLIP-fine-tune
My other (non-GmP) fine-tunes show characteristic overfit to training dataset expected for such a small batch size (and 5k-50k datasets); good models (good guidance as SDXL TE) ~0.5 - 0.7 on this “ImageNet/ObjectNet” dataset. Not a single model I fine-tuned without GmP ever outperformed the original model.
Most notably, there seems to be partial mitigation of the typographic attack vulnerability; when CLIP predicts its own texts to describe the image (gradient ascent: text embeddings → optim for cos. similarity → image embeddings → “CLIP opinion” about salient features), it is more likely NOT to be side-tracked by text, and classify (predict word) correctly:
It depends / is case-by-case, though. For OpenAI’s original truly bad example of an “ipod Apple apple”, and given CLIP does not distinguish between upper and lower case and alas learned Apple (company) as “apple”, performance does not improve:
Note that the list of choices includes self-predicted words (gradient-ascent) by the model, and how “ioapple” is almost on par with “ipod”. Not sure if an “I/O apple” actually counts as “not misclassified”, but - an interesting note. However, for the non-adversarial version, GmP-CLIP’s confidence in this being “apple” dramatically increases vs. pre-trained OpenAI/ ViT-L/14; note the improved attention, as per the heatmap (less irrelevant background attention).
GmP-CLIP in a nutshell:
"Normal" CLIP MLP (multi-layer perceptron):
(mlp): Sequential(
|-(c_fc): Linear(in_features=1024, out_features=4096, bias=True)
| (gelu): QuickGELU()
|-}-(c_proj): Linear(in_features=4096, out_features=1024, bias=True)
| |
| |-- visual.transformer.resblocks.0.mlp.c_fc.weight
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
|---- visual.transformer.resblocks.0.mlp.c_proj.weight
|---- visual.transformer.resblocks.0.mlp.c_proj.bias
GmP CLIP MLP:
Weight decomposition into:
- radial component 'r' as norm of pre-trained weights
- angular component 'theta' as normalized direction
-> preserves weight vectors' directionality and magnitude
(mlp): Sequential(
|-(c_fc): GeometricLinear()
| (gelu): QuickGELU()
|-}-(c_proj): GeometricLinear()
| |
| |-- visual.transformer.resblocks.0.mlp.c_fc.r
| |-- visual.transformer.resblocks.0.mlp.c_fc.theta
| |-- visual.transformer.resblocks.0.mlp.c_fc.bias
|
|---- visual.transformer.resblocks.0.mlp.c_proj.r
|---- visual.transformer.resblocks.0.mlp.c_proj.theta
|---- visual.transformer.resblocks.0.mlp.c_proj.bias
(Same thing for [text] transformer.resblocks)
Please find the full info and code to replicate the results on my GitHub!