I’ve generally had fantastic results telling Dalle exactly what I want a human character to look like in terms of appearance, clothing, etc. If I try to describe 2 human characters, it’ll do a pretty solid job with maybe a slight hiccup with clothing (aka possibly a color palette switch etc). 3 human characters seems to be where the prompt is getting overloaded. The characters are often conflated, traits being assigned to the wrong character and sometimes 2 characters looking nearly identical.
Moreover I’ve explored changing my characters’ genders, ethnicities etc to see how Dall-E responds and find it often ignores specific instructions. For example, a prompt explaining that C1 and C2 are men, C3 is a woman might turn C2 into a woman. If I do something like describing all 3 characters as white there seems to be a high probability of C3’s description being ignored and turned into a Black character. It’s kind of jarring because the Black male or Black female generated look nearly identical in each prompt run - same exact faces, hairstyle, clothing etc every time as if they were just a default placeholder. Describing all 3 characters as Black I find C3 is usually changed into a different ethnicity as well, albeit I see more variation as it’s sometimes a white person, an Asian person, etc.
Anyway I’m well aware that group images default to people of diverse background but it’s interesting that even if I try the “follow this prompt exactly” thing it tends to get ignored if I describe 3 characters. And that’s just generating 3 characters - if I wanted to then explain an action, a setting, a style etc I’d think it’s too much.
Guess I figured I’d check to see if anyone has had luck and/or pointers for wording a prompt to describe multiple characters effectively. It could be that I’m simply hitting the limit of what is possible now, which if so is fine, figuring out the limits is fun to me. But figured it was worth asking.
Let’s throw three distinct people into a prompt:
full-body publicity photo of three members of a girl rock band. Girl 1 has chin-length striking red hair and fair complexion, and wears round glasses, and is short. Girl 2 has large afro hair and dark skin, and is skinny. Girl 3 is Persian and Mediterranean with curly long hair.
The AI decided that two results were filtered and I only get two cartoons back…but they are as described (girl 1 has four hands though)
A lot more prompting of photo studio, real women, professional photograph, to get one image back from Bing…
It’s still not great at scene composition. The left girl is not rendered as “short.”
Might have to drop your individual characters into Photoshop for a final composition. The more words you input, just the more confusion is possible. A multi-line bullet-point list, brackets, or other semantics could group them, but then it is still like you say. I get my three anamatronic artists all wearing glasses.
Perhaps “full-body photograph” to avoid portraits also cues this lifeless style.
That’s better than most of my 3-person images come out, but that might be because I’m also trying to depict an action as well, allowing room for muddying the waters.
If I were able to just generate the characters in one prompt and, then, reference that exact image seed for it to use in generating an action in the next image generation, I wonder how good it would be at taking those same 3 characters and then depicting them in the action. Alas, I can’t really get consistency in subsequent prompts without constantly re-explaining how the characters should look. Would be a game changer for, say, adding illustrations to each chapter of a book you’re writing.
I think this is so cool as is, yet there’s lots of possibly untapped territory.
I had a brainwave, but the results are worth the two cents paid - use the edits endpoint, and have the AI “extend image to include two more girls”.
It is sufficient to say that the API’s image engine is poor…
Yeah in general it’s diminishing returns once you get to more than about 2 characters. 3 characters leaves you with a conundrum where too much detail on each will get character traits muddled/combined, too little detail will give you something pretty generic and lifeless.
Out of curiosity I generated 3 characters who at least vaguely resembled my intent, and I then said “OK, now take the characters in image #3 and add a 4th character” and the most often result was one of the original 3 being discarded in favor of the new character (and the original characters still there of course look slightly different from the base image, although to be fair they were close).
I did get it to generate a decent looking 4-person band but only when I didn’t describe any individual member, I just asked for a 4-person rock band. In terms of specific depictions think 3 is probably the most it can “focus” on at one time but even that is a stretch.