Geometric Parametrization fine-tune of ViT-L/14 on CoCo 40k (RTX4090, 20 Epochs, batch_size=40) outperforms OpenAI/ ViT-L/14 accuracy on ImageNet and (partially) mitigates typographic attack vulnerability

Reese · May 31, 2024, 9:05am

Update: Using Long-CLIP [1] (a modification of CLIP ViT-L/14 with expanded embeddings / 248 tokens input), seems to “do the trick” and entirely* eliminate the typographic attack vulnerability after GmP and fine-tuning on CoCo-SPRIGHT-40k with long “spatially right” labels. This was not the case with 77-tokens OpenAI pre-trained ViT-L/14, GmP-fine-tuned ViT-L/14 (albeit it showed signs of going in that direction, see above), nor with the original Long-CLIP by the authors in [1].

*Entirely: With a dozen or so example images-with-text-in-them that I had at hand.

[1] [2403.15378] Long-CLIP: Unlocking the Long-Text Capability of CLIP

Code for replication of results: GitHub - zer0int/Long-CLIP: Scripts for use with LongCLIP, including fine-tuning Long-CLIP

OpenAI’s “apple ipod” and “piggy bank poodle” examples with Long-CLIP vs. GmP-Long-CLIP (includes additional choices predicted by Long-CLIP for the image):

This CLIP knows what you did there with your, quote, CLIP: “absurd apple productfakespeare hoax”.

Topic		Replies	Views
CLIP GUI - XAI app ~ explainable (and guessable) AI with CLIP ViT & ResNet models Community open-source , project , clip	0	832	March 24, 2024
Beyond Few Shot Learning: Fine tuning with GPT-3 Community	2	606	July 16, 2021
"Semantic alignment" in GPT-3 via fine-tuning API	4	608	November 2, 2021
Makeshift CLIP vision for GPT-4, image-to-language > GPT-4 prompting Shap-E vs. Shap-E image-to-3D API gpt-4 , api , feedback	6	4514	June 26, 2023
What are the limits of fine tuning? API	4	2578	March 26, 2023

Geometric Parametrization fine-tune of ViT-L/14 on CoCo 40k (RTX4090, 20 Epochs, batch_size=40) outperforms OpenAI/ ViT-L/14 accuracy on ImageNet and (partially) mitigates typographic attack vulnerability

Related topics