Makeshift CLIP vision for GPT-4, image-to-language > GPT-4 prompting Shap-E vs. Shap-E image-to-3D

I’ve been an early adopter of CLIP back in 2021 - I probably spent hundreds of hours of “getting a CLIP opinion about images” (gradient ascent / feature activation maximization, returning words / tokens of what CLIP ‘sees’ in an image).
For context (in case spending hundreds of hours playing with CLIP “looking at images” sounds crazy), during that time, pretty much “solitary confinement” / lockdown rules still applied where I live. :wink:

For one, CLIP tokens are hilarious - they are beyond what I can think up behind my brain’s rigid filters, but still sufficiently related to how I interpret an image that it “makes some sense”; i.e., CLIP is - in my opinion - on par with the best human satirical comedians due to the “unexpected plot twists” the AI drops continously.
For example, a CLIP ResNet looking at its own “neuron” (OpenAI microscope feature activation image): “artificialintelligence trippyeyes library OMFG [shocked-face-cat-emoji] instantsurrealism”. :rofl:

But, CLIP “opinions” are also useful for prompting a CLIP inside any generative text-to-image AI with it; real example: “hallucinkaleidodimensional” is a perfect CLIP “opinion-token” weird-longword to use for making “mandala-like, psychedelic, fractal, abstract images”.

I found that out via trial-and-error of “letting CLIP ‘look’ at its own neurons and many other images, and using the CLIP ‘opinion’ tokens to prompt a [CLIP+VQGAN, guided diffusion unconditional CLIP]”.

Hence why I wanted to see what happens when I confront - initially, GPT-3.5 via API, using AutoGPT - with that “raw” opinion of a good old pre-trained CLIP in gradient ascent (rather than just forcing CLIP into “rather normal / boring human language” via zero-shot classification, e.g. via “CLIP interrogator” [which is an excellent and valuable tool, but not what I was after in this case]).

I found that GPT-* is very sophisticated in choosing the proper, albeit very weird, CLIP tokens to make a good prompt for [in my case, stable diffusion running local, due to the chance of AI-generated offensive prompts / images using an “unhinged” CLIP], i.e. a prompt that unexpectedly results in a good, coherent image.

Again and again, I found myself surprised about a coherent (on-topic), good quality image resulting from very strange prompts that I thought were gonna end badly, wondering “just why did GPT choose THOSE tokens, out of all available?”.

Initially, I did not yet have GPT-4 API access. And I found that when not just instructed with goal: “create an image” but with “create and image of sneakers”, GPT-3.5 would conclude that “the CLIP opinion is not very good” and proceed to run off to Pinterest to “get a better opinion for [its] prompt”, or - case: no internet - it would try to “get a CLIP opinion about a different image” because “the CLIP opinion is not very useful”.

Thus, I leveraged CLIP’s typographic attack vulnerability - including text that says “SNEAKERS” in my image depicting sneakers, for example - and left GPT-3.5 with the rather arbitrary “create an image” goal. Despite many “weird” tokens of all the things CLIP 'sees", this was able to steer GPT-3.5 nicely (and prevented it from running off to the internet and shunning “the CLIP opinion”).

[Unfortunately, I can still only post one “media item” as a “new user”; image “init image with text” removed]

I always knew that was a FEATURE of CLIP, and not just a BUG or a problem. :slight_smile:

[gradient ascent example image with tokens and loss removed]

For example, CLIP (3D render of sneakers, with text “SNEAKERS” and “FOOTWEAR” and “CYBERWEAR” in it): effective footwear cyberdesigner simulation cyberfanart sneakers manufacturers tware scycybersubjects designing designs inventor ware sneaker software gameweaver

AutoGPT goal (my): “Use 5 suitable CLIP opinion tokens from the txt file, including the tokens that may seem like nonsense, and combine them to create a coherent and interesting prompt for image generation. Ensure the prompt makes sense and is related to the image generation task. Be creative! Example prompt: A in a , inspired by , professional photography, highly detailed”

GPT: “A cyberdesigner sneaker with effective cyberfanart, inspired by simulation software, and created by top manufacturers”

Now, I am super grateful for GPT-4 API access, as GPT-4 can be instructed to generate images of (e.g. sneakers) and STILL adhere to the “CLIP opinion”, without running off to the internet because the CLIP tokens are not palatable:

CLIP (about previous stable diffusion image): polaroid optical weekday vans mods color ■ supersonic browse sb vimeo acapcolourful hornets saturated colorful rainbow mss product sneakers

GPT-4: “A polaroid optical sneaker in a weekday vans mods color, inspired by supersonic browse, highly detailed”

It sounds like a very odd prompt, but the results are great (stable diffusion 2.1, V-model / 768px)!
This was done with an init image of sneakers → get a CLIP opinion → make a prompt for stable diffusion → obtain a CLIP opinion about the image generated in the previous step → repeat. One continuous and flawless loop with GPT-4 (total: 42 iterations, then I cut the AI off / stopped it because $10 is enough to prove a point: The AI is AI-mazing! GPT-3.5 would get confused (catastrophic forgetting of what the AI is even doing) and act erratically after 6-10 iterations at best)

What I found very curious is that a new instance of ChatGPT-4 instructed in the same way and presented with the same CLIP “opinion” tokens will have a strong ‘preference’ for some (not all, but some very strongly) CLIP tokens.

In the above example, a “fresh” ChatGPT-4 instance will choose “optical polaroid” or “polaroid optical” + “weekday” to make a prompt (albeit the position of the words in the prompt varies). Must be really outstanding logprobs for these terms!

So, maybe it’s not just my human “confirmation bias”, but the AI’s embeddings are indeed “aligned” in some weird ways, and “GPT knows something I don’t know” (at least when it comes to prompt engineering for a generative AI that has a CLIP inside)?

There are many - countless - of such examples. Just why are “optical polaroid” sneakers so fashionable with GPT-4? :slight_smile:
“Vans” makes perfect sense to a human, even a skateboard being generated at random does - but why is “optical polaroid” so much closer to “sneakers” than “hornets”, which the AI never picks for the prompt? Curious!

My most recent experiment involved this AImazing “Shap-E” that you (OpenAI) just silently dropped on github.

Comparing image-to-3D Shap-E vs. image-to-CLIP-to-language to GPT-4 → prompt for Shap-E text-to-3D, it appears that GPT-4 is very often able to extract the actual content of the image from CLIP’s mad token gibberish.
The goal constraints with regard to prompt design need adjustment from my side though, GPT-4 had a tad too much freedom there (and made things too complex for Shap-E, with prompts that’d better suit text-to-image than text-to-3D).

[Shap-E results images removed due to limits; will try and post a follow-up with that image]

I find it bewildering (in an awe & awesome way) that a pre-trained CLIP (and I can’t even fit the largest ViT models into my humble 24 GB of VRAM for gradient ascent!) and GPT-4 seem to “click” in such coherent ways.

Thanks for the inspiration for this via your GPT-4 release party / tech demo – I am certain that for this, you are not using a smaller CLIP, but likely the biggest AND finetuned CLIP you have available - and you likely don’t just prompt GPT-4 with a CLIP “opinion” either. Probably you are using “other AI” akin “Segment Anything” (SAM, Meta), too.

That’s why I am even more in awe about the results of my makeshift pseudo-multimodal “vision for GPT-4”.

Thanks again for the API access - for which I also have ChatGPT-4 to thank, because without ChatGPT-4’s coding skills, I would not have been able to make this project happen with GPT-3.5 and use my results with this project to apply for API access to GPT-4. :nerd_face: :pray:

So: Thanks for ChatGPT-4, API access to GPT-3.5, and API access to GPT-4 – in that order. It’s tremendous fun to play with this CLIP “running local like a mad dog, chasing trails of text in images”, and seeing what surprising things GPT-4 makes out of that. :blush:

So long, and thanks for all the AImazing awesome AI, in general!

PS: Here’s one more “funny outtake”; I was asking ChatGPT-4 to “compress the goal prompts” and do so in a way that itself understands them, ignoring human understanding (to avoid the context length exceeded API error in more complex tasks).
The AI’s system prompt seems to steer GPT-4 back into a direction of “verbosely elaborating around”, so I had to “remind it” to make things not-human-readable and rather aiming for maximum compression (reduction in total tokens).

GPT-4, quote: “Remember, GPT-4 is a language model and isn’t designed to understand or execute programming concepts like loops in the same way a traditional programming language would.”

I laughed. I heard that sentence before, though in the context “not like a human would”.
Does that originate from RLHF, did you try to mitigate the issue of people putting programming code into the LLM’s prompt and hoping for it to compile? Haha, I doubt it.
That’s likely emergent from the “not like a human would” RLHF conditioning. Funny! :slight_smile:

Here’s one of the Shap-E images, at least - I don’t want to push my luck and risk getting blocked by making an AI suspicious of this being spam ;-):

Can we replace a new dataset instead of the original dataset used in the shape-e model ???

I am not sure what you are referring to with that question; you can use any kind of image as input for Shap-E (though some images might work better than others, especially with regard to the depicted object vs. a background being present); see here:
https://github.com/openai/shap-e

If you are referring to my Auto-GPT project that uses Shap-E, you can, likewise, adjust it to use any input (“goal prompt”) you like, be it an image generated via text-to-image AI via a previous step, or just your own starting image (but in general, the more complex the goals are, i.e. “juggling multiple AI at once” in a multi-step process, the more likely it is that GPT-3.5 will end up utterly confused and only GPT-4 being able to handle the task).

As I don’t know where you are coming from, just in case, here’s the github link to my Auto-GPT “pseudomultimodal vision” project for GPT-3.5 and GPT-4:

https://github.com/zer0int/Auto-GPT

PS: This is a fork of an older version of Auto-GPT. Yes, I am aware of function calling implementations as well as larger memory (GPT-3.5), and yes, I do intend to look into that 1. after the Auto-GPT folks did and 2. once I have time.

Hope that helps!

Thank you for the response and time.

I am currently exploring the application of different datasets within generative AI models, more specifically models like Shape-E, Dreamfields, and Dreamfusion. My primary question revolves around the possibility and the mechanics of altering these models’ input datasets.

  1. Have there been any successful attempts or documented instances where an alternative dataset was used for training the Shape-E model, replacing the original dataset? If so, could you please provide some details or direct me to relevant resources?

  2. Furthermore, I’m interested in generating very specific 3D models - in this case, models of plants with non-typical shapes. As an example, if the input request is “plant in the shape of a cube,” the model should ideally generate a plant conforming to a cubic structure, rather than generating a simple cube model. Is it possible to modify the original dataset in these generative models to create such specific outputs? Would the training process be different or pose specific challenges?

Any insights or guidance regarding the aforementioned queries would be greatly appreciated.

Thank you in advance for your time and assistance.

Ahh, I think I understand now - you want to fine-tune the model on a dataset of your own!

At least, “replacing the original dataset” would imply the creation of a model from scratch, and re-create the works of the researchers from OpenAI. I have read somewhere that the training was very cost-effective; but cost-effective still likely means in the range of ~$10,000.

For perspective, training CLIP or a Stable Diffusion model was more like on the order of some $100,000 of dollars, while large language models are on the order of 1 million to a few hundred million dollars to train the foundational model.

However, fine-tuning means to leverage the carefully selected dataset that the engineers have crafted to produce the pre-trained model (or multiple models, as the case here), and build on that. Which is what your use-case suggests / implies.

Because you’ll want a model that has learned all the concepts of all the plants that exist in the world (or at least those that exist in the original training dataset), and apply your specific “cubic” look (3D model) to them.

Now, the bad news is: I have no idea how to help you with that. While I have done fine-tunes for text-to-image models, I would not even know exactly what a 3D dataset (vs. an image dataset for a “2D” text-to-image model) for fine-tuning Shap-E would have to “look like”.

The OpenAI researchers explain it in their paper, though:

A.2 Encoder Training
We pre-train our encoders for 600K iterations using Adam [29 ] with a learning rate of 10−4 and a
batch size of 64. We perform STF distillation for 50K iterations with a learning rate of 10−5 and
keep the batch size at 64. We query 32K random points on each 3D asset for STF distillation. We
fine-tune on STF renders for 65K iterations with the same hyperparameters as for distillation. For
each stage of training, we re-initialize the optimizer state. For pre-training, we use 16-bit precision
with loss scaling [37], but we found full 32-bit precision necessary to stabilize fine-tuning.

https://arxiv.org/pdf/2305.02463.pdf

Now the good news is, you’re already in the right place - just in the wrong forum thread. I am only using the original models in my project; I am not altering / fine-tuning any of them. So I’d recommend making a new topic here, and asking “How to fine-tune Shap-E on a custom dataset?”.

If you find out, I’d appreciate if you let me know - just for the sake of mere curiosity on my side! :slight_smile:

Good luck! :+1:

PS: An acceptable fine-tune of a text-to-image model (meaning, a fine-tuned model that has learned the concepts of my custom dataset sufficiently to apply them to different, new concepts), with not-ideal batch size due to restraints in terms of VRAM, with a rather small dataset but still only a bare minimum of training iterations, took “over night” on a consumer GPU (RTX 3090).

I am not sure if that is what to expect in fine-tuning a 3D model, too. It’s just for a very rough perspective on “training cost”, in terms of compute.