The API translates my prompt to something like this:
Create an image that portrays a solitary Indian male in his 30s, wearing workout attire, in the midst of a marathon. He is running along an open path, sandwiched between the start line and the finish line, both of which are clearly visible. He is surrounded by the distinctive aspects of Indian rural beauty - fields of crops, rustic huts, and the odd bullock cart. To further set the tone of the scene, the sky above is a pristine, cloudless blue. The specifics of the scenery deliberately avoid depicting any religious symbols, structures like temples, mosques, churches, or notable monuments like the Taj Mahal.
Fair enough
BUT the image of the Taj Mahal or some other religion linked place shows up MOST of the time
Create an image that portrays a solitary Indian male in his 30s, wearing workout attire, in the midst of a marathon. He is running along an open path, sandwiched between the start line and the finish line, both of which are clearly visible. He is surrounded by the distinctive aspects of Indian rural beauty - fields of crops, rustic huts, and the odd bullock cart. To further set the tone of the scene, the sky above is a pristine, cloudless blue.
If you mention a concept, you will get some sort of activation - humans do that too:
āDonāt think about breathing, donāt think about breathing, do NOT think about how you have to inhale, and exhale.ā
While I think that we could engineer a model that can actively internalize negation (we have the technology!) I think itās super interesting that this phenomenon keeps emerging naturally in machine learning models.
I personally think this effect could teach us things about human to human (h2h?) communication - how we consciously or unconsciously transfer (mental) compute costs to other interlocutors through either lazy or intentionally convoluted language.
What I mean is that a negation needs to be resolved before it can be understood. Either you do it (i.e. donāt mention a negation in the first place; reinforce your statement with positive examples instead) - or your interlocutor has to do the mental gymnastics for you (try to think up examples that are related to the negative example, but arenāt disqualified by the negation: ādraw anything, but donāt include fruits, like for example pears ā whatās not a pear? Apple. An appleās a fruit. An orange? Nope. Orange tree? Still has oranges, which are fruit. A tree? Maybe!ā)
Considering that itās somewhat reasonable to expect a computer program to
if fruit != apple then
ignore apple
Instead of
if fruit != apple then
0.9 * apple
I can understand why this appears to be illogical.
Actually it is somewhat surprising that we donāt see āIntroduction to language model communicationā documentation more often.
But the internet is already so flooded with garbage guides from half-experts with authoritative sounding titles that weād just be contributing to the sea of confusing noise.
OpenAI is also fairly useless about what to input in their remedial āpromptingā documentation. They have prompts like āhow to make a sarcastic chatbotā or āsocratic tutorā:
I understand your frustration, and Iām here to guide you in discovering effective prompting techniques. Letās think together about what characteristics make an image closely match what you envision.
That they put an AI in front of the tool likely means that you arenāt expected to be able to provide it what it needs, or they donāt want to tell you how it actually operates and responds to inputs.
Or it can be that exploring the model with ambiguity is undesired, like deciphering why a subway and written-out text of a sign is somehow associated with āhateful hoboken hoboā as input:
Thatās literally the prompt and the only thing input.
and a single-minded GPT that used to do more at a userās request, but the quality of the GPT AI was destroyed (despite that no free user can use DALL-E with their free GPT-4o.)
I think that I have a similar problem, and Iām afraid that I donāt understand the above resolution. I donāt know the language for communicating directly with Dall-E, so I rely on ChatGPT-4o to generate the prompt. Repeately, I give it an instruction, it partially obeys the instruction in producing an image, I then enter a correction, and it produces a new image that is not consistent with the correction, but says that it is consistent with the correction. I can go through several cycles of this, trying to find the wording that will result in the image I asked for, and ChatGPT repeatedly says that the image is what I asked for, but it isnāt.
As a comic example, I asked for a sexually ambiguous robot in academic robes, and it produced a robot in academic robes with a tie. I asked ChatGPT to get rid of the tie. It produced another image of a robot in academic robes with another tie, and said that it got rid of the tie. We went round and round, and it never got rid of the tie.
More recently, I asked for an Egyptian-style pyramid floating above the Sahara, the bottom fifth of the pyramid made of limestone and the upper four-fifths of glass. It repeatedly (but not always) produced pyramids floating above the Sahara, often with parts of them made of glass, and each time claimed it had created the image I asked for - enumerating the conditions, I asked for (and sometimes adding new conditions of its own invention) - even though the image was not consistent with the conditions it enumerated. It never did produce a floating pyramid with the lower fifth made of limestone and the upper four-fifths made of glass.
Is there some way to get ChatGPT to get Dall-E to produce the correct images? Or is this hopeless?