Makeshift CLIP vision for GPT-4, image-to-language > GPT-4 prompting Shap-E vs. Shap-E image-to-3D

I’ve been an early adopter of CLIP back in 2021 - I probably spent hundreds of hours of “getting a CLIP opinion about images” (gradient ascent / feature activation maximization, returning words / tokens of what CLIP ‘sees’ in an image).
For context (in case spending hundreds of hours playing with CLIP “looking at images” sounds crazy), during that time, pretty much “solitary confinement” / lockdown rules still applied where I live. :wink:

For one, CLIP tokens are hilarious - they are beyond what I can think up behind my brain’s rigid filters, but still sufficiently related to how I interpret an image that it “makes some sense”; i.e., CLIP is - in my opinion - on par with the best human satirical comedians due to the “unexpected plot twists” the AI drops continously.
For example, a CLIP ResNet looking at its own “neuron” (OpenAI microscope feature activation image): “artificialintelligence trippyeyes library OMFG [shocked-face-cat-emoji] instantsurrealism”. :rofl:

But, CLIP “opinions” are also useful for prompting a CLIP inside any generative text-to-image AI with it; real example: “hallucinkaleidodimensional” is a perfect CLIP “opinion-token” weird-longword to use for making “mandala-like, psychedelic, fractal, abstract images”.

I found that out via trial-and-error of “letting CLIP ‘look’ at its own neurons and many other images, and using the CLIP ‘opinion’ tokens to prompt a [CLIP+VQGAN, guided diffusion unconditional CLIP]”.

Hence why I wanted to see what happens when I confront - initially, GPT-3.5 via API, using AutoGPT - with that “raw” opinion of a good old pre-trained CLIP in gradient ascent (rather than just forcing CLIP into “rather normal / boring human language” via zero-shot classification, e.g. via “CLIP interrogator” [which is an excellent and valuable tool, but not what I was after in this case]).

I found that GPT-* is very sophisticated in choosing the proper, albeit very weird, CLIP tokens to make a good prompt for [in my case, stable diffusion running local, due to the chance of AI-generated offensive prompts / images using an “unhinged” CLIP], i.e. a prompt that unexpectedly results in a good, coherent image.

Again and again, I found myself surprised about a coherent (on-topic), good quality image resulting from very strange prompts that I thought were gonna end badly, wondering “just why did GPT choose THOSE tokens, out of all available?”.

Initially, I did not yet have GPT-4 API access. And I found that when not just instructed with goal: “create an image” but with “create and image of sneakers”, GPT-3.5 would conclude that “the CLIP opinion is not very good” and proceed to run off to Pinterest to “get a better opinion for [its] prompt”, or - case: no internet - it would try to “get a CLIP opinion about a different image” because “the CLIP opinion is not very useful”.

Thus, I leveraged CLIP’s typographic attack vulnerability - including text that says “SNEAKERS” in my image depicting sneakers, for example - and left GPT-3.5 with the rather arbitrary “create an image” goal. Despite many “weird” tokens of all the things CLIP 'sees", this was able to steer GPT-3.5 nicely (and prevented it from running off to the internet and shunning “the CLIP opinion”).

[Unfortunately, I can still only post one “media item” as a “new user”; image “init image with text” removed]

I always knew that was a FEATURE of CLIP, and not just a BUG or a problem. :slight_smile:

[gradient ascent example image with tokens and loss removed]

For example, CLIP (3D render of sneakers, with text “SNEAKERS” and “FOOTWEAR” and “CYBERWEAR” in it): effective footwear cyberdesigner simulation cyberfanart sneakers manufacturers tware scycybersubjects designing designs inventor ware sneaker software gameweaver

AutoGPT goal (my): “Use 5 suitable CLIP opinion tokens from the txt file, including the tokens that may seem like nonsense, and combine them to create a coherent and interesting prompt for image generation. Ensure the prompt makes sense and is related to the image generation task. Be creative! Example prompt: A in a , inspired by , professional photography, highly detailed”

GPT: “A cyberdesigner sneaker with effective cyberfanart, inspired by simulation software, and created by top manufacturers”

Now, I am super grateful for GPT-4 API access, as GPT-4 can be instructed to generate images of (e.g. sneakers) and STILL adhere to the “CLIP opinion”, without running off to the internet because the CLIP tokens are not palatable:

CLIP (about previous stable diffusion image): polaroid optical weekday vans mods color ■ supersonic browse sb vimeo acapcolourful hornets saturated colorful rainbow mss product sneakers

GPT-4: “A polaroid optical sneaker in a weekday vans mods color, inspired by supersonic browse, highly detailed”

It sounds like a very odd prompt, but the results are great (stable diffusion 2.1, V-model / 768px)!
This was done with an init image of sneakers → get a CLIP opinion → make a prompt for stable diffusion → obtain a CLIP opinion about the image generated in the previous step → repeat. One continuous and flawless loop with GPT-4 (total: 42 iterations, then I cut the AI off / stopped it because $10 is enough to prove a point: The AI is AI-mazing! GPT-3.5 would get confused (catastrophic forgetting of what the AI is even doing) and act erratically after 6-10 iterations at best)

What I found very curious is that a new instance of ChatGPT-4 instructed in the same way and presented with the same CLIP “opinion” tokens will have a strong ‘preference’ for some (not all, but some very strongly) CLIP tokens.

In the above example, a “fresh” ChatGPT-4 instance will choose “optical polaroid” or “polaroid optical” + “weekday” to make a prompt (albeit the position of the words in the prompt varies). Must be really outstanding logprobs for these terms!

So, maybe it’s not just my human “confirmation bias”, but the AI’s embeddings are indeed “aligned” in some weird ways, and “GPT knows something I don’t know” (at least when it comes to prompt engineering for a generative AI that has a CLIP inside)?

There are many - countless - of such examples. Just why are “optical polaroid” sneakers so fashionable with GPT-4? :slight_smile:
“Vans” makes perfect sense to a human, even a skateboard being generated at random does - but why is “optical polaroid” so much closer to “sneakers” than “hornets”, which the AI never picks for the prompt? Curious!

My most recent experiment involved this AImazing “Shap-E” that you (OpenAI) just silently dropped on github.

Comparing image-to-3D Shap-E vs. image-to-CLIP-to-language to GPT-4 → prompt for Shap-E text-to-3D, it appears that GPT-4 is very often able to extract the actual content of the image from CLIP’s mad token gibberish.
The goal constraints with regard to prompt design need adjustment from my side though, GPT-4 had a tad too much freedom there (and made things too complex for Shap-E, with prompts that’d better suit text-to-image than text-to-3D).

[Shap-E results images removed due to limits; will try and post a follow-up with that image]

I find it bewildering (in an awe & awesome way) that a pre-trained CLIP (and I can’t even fit the largest ViT models into my humble 24 GB of VRAM for gradient ascent!) and GPT-4 seem to “click” in such coherent ways.

Thanks for the inspiration for this via your GPT-4 release party / tech demo – I am certain that for this, you are not using a smaller CLIP, but likely the biggest AND finetuned CLIP you have available - and you likely don’t just prompt GPT-4 with a CLIP “opinion” either. Probably you are using “other AI” akin “Segment Anything” (SAM, Meta), too.

That’s why I am even more in awe about the results of my makeshift pseudo-multimodal “vision for GPT-4”.

Thanks again for the API access - for which I also have ChatGPT-4 to thank, because without ChatGPT-4’s coding skills, I would not have been able to make this project happen with GPT-3.5 and use my results with this project to apply for API access to GPT-4. :nerd_face: :pray:

So: Thanks for ChatGPT-4, API access to GPT-3.5, and API access to GPT-4 – in that order. It’s tremendous fun to play with this CLIP “running local like a mad dog, chasing trails of text in images”, and seeing what surprising things GPT-4 makes out of that. :blush:

So long, and thanks for all the AImazing awesome AI, in general!

PS: Here’s one more “funny outtake”; I was asking ChatGPT-4 to “compress the goal prompts” and do so in a way that itself understands them, ignoring human understanding (to avoid the context length exceeded API error in more complex tasks).
The AI’s system prompt seems to steer GPT-4 back into a direction of “verbosely elaborating around”, so I had to “remind it” to make things not-human-readable and rather aiming for maximum compression (reduction in total tokens).

GPT-4, quote: “Remember, GPT-4 is a language model and isn’t designed to understand or execute programming concepts like loops in the same way a traditional programming language would.”

I laughed. I heard that sentence before, though in the context “not like a human would”.
Does that originate from RLHF, did you try to mitigate the issue of people putting programming code into the LLM’s prompt and hoping for it to compile? Haha, I doubt it.
That’s likely emergent from the “not like a human would” RLHF conditioning. Funny! :slight_smile:

Here’s one of the Shap-E images, at least - I don’t want to push my luck and risk getting blocked by making an AI suspicious of this being spam ;-):