DALL-E is illiterate with the text it adds in images

Given that DALL-E is vastly more illiterate than a newborn, how do I get it to follow extremely clear instruction to just NOT put any words in the images it creates? As you can see from this screenshot, it is utterly unable to follow instructions.

3 Likes

Welcome to the forum. Because of the way LLMs work, they’re not great at negative prompting - ie telling them NOT to do something. They’re a lot better than they were in the past, but it still causes problems.

Additionally, it’s not going to be able to change the original version exactly. DALLE3 edit endpoint isn’t working yet, but it might be possible using the edit endpoint.

For cleaning up the original image, I’d recommend a tool like Photoshop or something that will allow you to select the text areas and remove it cleanly while matching the background image.

Hopefully this will be able to be automated more in the future, but for now, it’s out of the limits of its capabilities.

Hope this helps.

3 Likes

@ariaconsulting

Could you please provide a more descriptive title? The current one doesn’t effectively convey the issue you need assistance with, which might make it challenging for others to locate this topic.

Thank you! :blush:

Thanks Paul. Thing is, if I had any abilities in PS (or really any other image editing software including pretty much everything from Adobe for that matter), I probably wouldn’t ask the AI to create anything as I could do it myself.

For software that is part of a large LANGUAGE model (from an org worth 10’s of billions of dollars), one would think it could spell the words IT adds to images. I could be more lenient in my view if I was asking the software to add words. But I’m not. I am, in fact, trying very hard to get it to just follow a highly explicit command. So, its not only illiterate, its also unwilling to cooperate.

How exactly is this worth 10s of billions??

1 Like

Hey Eric, I changed the title as requested.

1 Like

Because even with its limitations, this stuff wasn’t possible a few years ago. It’s getting better.

The LLM and image generation stuff are slightly different, so just because it can deal with language in text, it doesn’t necessarily transfer over (easily) to images. I imagine things will continue improving at an accelerated rate.

1 Like

So, there’s nothing that can be done to prevent these outcomes until/unless OpenAI decides to fix it. IF they do. Got it. SUPER worthwhile to pay for…:angry:

Thanks Paul.

1 Like

That’s what I’m trying to say… it will get better. It’s not that it’s “broke” … rather the technology just hasn’t improved enough yet.

Personally, I find a lot of value in the $20/month ChatGPT Plus subscription. It doesn’t do everything for you (yet), but it’s a force multiplier for sure.

4 Likes

@ariaconsulting you need to understand a little about how the models work to know why what you are asking for is beyond the model’s capabilities.

DALL-E 3 is a diffusion model, it basically takes a field of random noise and finds the image inside of it—it doesn’t just plop text on top of a generated image. It has gotten better at “finding” the text in the noise, but it’s just not there yet.

At the end of the day, there will probably need to be a few more advances, a new tool/workflow, or some substantial post-processing required to get production-ready text into AI-generated images.

3 Likes

I could accept that as the situation much more easily if it would at least follow explicit instructions. If I type an “A” in Word, I expect to see an “A”. Not a “J”. If I click on “generate invoice” in my A/R system I expect to get an invoice. Not an inventory transfer.

When I tell DALL-E to do something “exactly” it ignores it completely. If I tell it explicitly to not do something, it ignores it. So, basically, the only use for this technology is if you are either a) happy to accept whatever it creates first time around, or b) have the skills (and subscriptions!) to Adobe to fix everything yourself because the software IS broken. After 25 years working in business technology (ECM, CRM, ERP, digital workplace) as a programme/project manager, solution and enterprise architect, and management consultant, I feel I have some qualifications to determine broken software from acceptable software. And software that ignores user commands is BROKEN. Sorry, just how I see it.

If I was to tell a customer their CRM software was acceptable when users were deleting records when they click “save”, I would 100% expect that customer to fire my ass.

Sure. It’s broken in the sense that it doesn’t do exactly what you want, which is be able to not only create an image perfectly to your description, but also view it, and adjust it under the updated specifications.

As mentioned, you should learn about the tools before you start using them and expect the world from them. That really should be step one of introducing a new tool as a project manager.

I swear. This is like being a part of the discovery of boats and being like “WTF, I have to paddle? Can’t it just move itself?” In due time, sure.

This is a new & part of revolutionary technology. If you seriously can’t edit this to remove/adjust the font then either learn how to as it’s very simple. Or, pay someone 70% less of what it would cost it commission the complete artwork in the first place.

1 Like

Please take this constructively. :slightly_smiling_face:

I call this the “Whoooooa, Betsy” syndrome.

In the days before automobiles, people used to ride horses, often one named Betsy. To make Betsy stop, they would pull back on the reins and say, “Whoooooa, Betsy.” Then, with the advent of cars, when they wanted to stop, they’d pull back on the steering wheel and exclaim, “Whoooooa, Betsy.” Needless to say, it didn’t work.

1 Like

That’s not how diffusion models work—they aren’t instruction following machines.

Very broadly speaking, they take a bunch of noise and iteratively de-noise it until they arrive at a place where the current state of the pixels is an arrangement which is statistically likely to have the input prompt as a label.

Your displeasure with the model seems to stem from fundamentally misunderstanding how they work which has led to forming unrealistic expectations of their capabilities.

With respect to text this is an area in which all image generating AIs have struggled with. It’s something that is actively being worked on, but disregarding am entire image generating model because it cannot perfectly render text is a bit like judging a fish for being bad at climbing trees.

I understand your frustration and I know no amount of explanation as to why the model is unable to meet your expectations will make you feel any better about it, but at this point I’m unclear what you’re hoping for from this topic.

It has been explained that there isn’t a model capable of doing what you want, are you just looking to vent?

3 Likes

That’s not how diffusion models work—they aren’t instruction following machines

I definitely agree with THAT!

To be clear, I stated that it would be much easier to accept limitations and growing pains as far as the outputs if it could be made to follow the most basic and explicit instructions in a prompt. It would also seem way more helpful if, when you ask the software if it fully and completely understands the instructions in the prompt, it came back with something other than an obviously erroneous response.

You’ve stated that it can’t follow any instructions due to how the model is designed. I’m not an AI/ML developer so I accept that the model itself is the limitation and it will only ever be an expensive random image generator (which have existed for YEARS, by the way). Fair enough, I don’t expect a stove to make ice cubes either :grin:.

Since nothing can be done, I suppose that makes anything to do with this topic “just venting”. But graphic artists are clearly VERY safe in their jobs for quite some time yet. At least those that can follow “do/don’t” instructions. And spell! :sweat_smile:

As far as your statement:

Your displeasure with the model seems to stem from fundamentally misunderstanding how they work which has led to forming unrealistic expectations of their capabilities.

My expectations are based upon what Open AI themselves claim about the tool. I would encourage you to LISTEN to what they say right in the DALL-E home page: DALL·E 2 (openai.com)

Anyway, thank you, I appreciate the assistance.

Yeah. It’s basically a one-shot opportunity. You can kind of work off the same prompt but it still can be different.

I miss the old Dall-E. It had the ability to erase specific areas and ask the model to recreate it, which would help you try to accomplish what you want. It had a lot of other cool tools on top of that.

I disagree about it being expensive though. The images it produces are very powerful. You have hit both of it’s weaknesses (revisions & coherent text). For everything else, it’s incredible.

1 Like

I have a freaking talent for finding weak spots like that, @RonaldGRuckus :wink:.

Seriously, Microsoft should send me monthly cheques in gratitude for all the broken stuff they release that I tend to be one of the first/only people to come across!

When MS first released the Marketing module in their Dynamics 365 CRM cloud, I was one of the first few people in North America to actually try it. ELEVEN support tickets stretching over FIVE MONTHS before we could get it to actually send an email…:roll_eyes:

Well, to be fair, text generation has been a known issue in generative image models for years and the ability to edit an image in DALL-E 3 is a feature that existed in DALL-E 2 which just hasn’t been released yet for DALL-E 3, so I would be reluctant to credit you as a discoverer of these shortcomings.

Hey, I had a similar issue when designing a logo (For context, it was a yellow lily, with a flat white background and underneath it was the word “Lily” in black), in order to fix it I asked it to remove the text and got a response saying all text was already removed. So instead I asked it if it saw the black area that was underneath the flower, it confirmed that it did see it, then I asked it to remove that so the image would only be a yellow flower and the white background, and “no black thing” underneath. For me that worked.

If you are not satisfied with the outcome you get from trying that though (for me it gave me a different image at first), you can have it create something and just pay someone to edit out whatever you want changed using photoshop. It’d still be a lot cheaper than having them make your whole picture from scratch.

as a bit of constructive criticism to you though, it seems like you have very high expectations for a very new piece of technology. Yes the company says it’s amazing, all companies do that with their products. You have to look at the product and it’s reviews for yourself from a more objective standpoint because every company thinks their product is great. It’s important to understand tools and their limitations, as well as how to use them effectively so you don’t end up using a drill to hammer a nail. I hope you find a good resolution to your problem :slight_smile:

1 Like

Thank you for highlighting this for others. Note, I’m not an OpenAI employee. As I have a need for generative images, your reminder prompted me to visit the page.

I also discovered that there’s an OpenAI blog post about DALL-E 3: DALL·E 3 (openai.com), which includes a link to the research paper: DALL·E 3 Research Paper (PDF).

Since I regularly read research papers, I’ll be diving into this one in detail over the next few days. For now, I’ve just given it a quick look.

I wouldn’t rush to that conclusion. As I mentioned, I plan to read the research paper and combine the insights I gain from it with constructive feedback from others who have used DALL-E 3. My goal is to improve my effectiveness in creating the images I need. Essentially, the paper will serve as a source of inspiration for optimizing my use of DALL-E 3.

As you can see in the screenshot, the larger problem became the unwillingness (if you can assign “will” to software I suppose) on DALL-E’s part to follow very explicit instructions to do or not do something.

As to my expectations, I hear you. However, when the 2nd most valuable company on Earth bulldozes $13B in one round into a company, making it half as valuable as Linkedin, which had been around years longer when MS bought it, and had proven revenue streams, hundreds of millions of users, etc., I expect it to be at LEAST half as good.