Discussion for a fundamental solution for a fundamental problem of dalle

DALLE and ChatGPT absolutely SUCKS at following specific instructions when it comes to image generation.




As you can observe here, both DALLE and ChatGPT are horrendous at creating images while abiding by specific user instructions. I also experimented with this prompt:

// Return the user the final prompt, seed, and gen_id of each created image.

Use dalle to create 10 hd images with these specifications: 

size (of each image): 1024x1024,
context: 2D,
prompt: 

// 1. When creating the next image, always use the seed and gen_id of the previous image from the referenced_image_ids field in the dalle payload for consistency.

// 2. Each of the 10 images will be part of a 2D gif animation when combined by the user.

// 3. Each image must only contain 1 character. DO NOT include movement OR multiple characters in a single image. Motion transitions must be achieved solely via iteration of the 10 images by the user after receiving all images.

// 4. As the 10 images transition from one another, they must depict a continuous transition of motions of a baseball player throwing a baseball. 

// 5. As each of the 10 images transition from the previous image, make visible position changes of the baseball within each sprite's bounding box. This DOES NOT mean you should include multiple baseballs within one image. Movement is always represented by transitions between images.

// 6. Since you can only generate a limited number of images in a single response, generate the 10 images sequentially - maximum number of images per response - such that when the images are combined, they represent a continuous animation. Continue without asking.

which also did not work. It’s almost as if dalle is ignoring my instructions.

To this end, I am hell bent on implementing a solution that allows users to edit images in a specific way using dalle or even some other stable diffusion method.

First and foremost, I now know that the ChatGPT in the browser interface has advanced access to dalle such that it can retrieve some textual information about the image as well as the image itself(gen_id, seed, its “knowledge” of the image). However, my attempts to replicate this behavior using the API has been futile. I even tried using the gpt-vision model, but the fundamental problem is that the gpt-vision model cannot “share” what it knows about an input image to dalle for image generation other than a textual description of the image.

Research Question 0: In the browser ChatGPT interface, how does it “share” knowledge between the browser tool, dalle tool, and python tool? Does it use RAG internally? If one were to replicate the same behavior using a conjunction of the OpenAI API and some other tools, would RAG be the correct approach in making different models(e.g. gpt-vision, gpt-4-turbo, dalle) share their knowledge to produce a more seamless response?

Research Question 1: For my objective of being able to make specific edits, specifihttps://community.openai.com/u/utilityfog/messagesc creations, or specific variations of an input image, should I just ditch dalle entirely and look at stable diffusion models in huggingface? Anyone used onejourney? Why does dalle so fundamentally suck at understanding simple instructions and generating output accordingly?

Here are some useful links I found:

Open AI Retro # Credits to icdev2dev; thread? based potential solution

RAG Introduction

RAG Application

Pinecone Vector Database # RAG DB

AutoChat # Selenium based potential solution

Medium article for Selenium+ChatGPT # Selenium based potential solution

Unofficial ChatGPT browser API # Selenium based potential solution?

Unofficial OpenAI API # Selenium based potential solution?

Medium article for RAG+image gen # RAG based potential solution

Amazon article for RAG+image gen # RAG based potential solution

I am working religiously, and I will post constant updates. I really appreciate the community’s help.

It is rewriting the instructions that you gave it. So it seems to you that it is ignoring the instructions that you gave it. Attributing a reason to why open ai is doing this is not helpful in the sense that others have asked with no avail.

You certainly have other options. However dall-e-3 is the best in class. That’s why they are seemingly able to get away with it.

But it’s revised prompts, both when I checked using the API and the browser interface, are not too different from my original instructions. More often than not, they are more detailed. It seems to be the case that dalle sucks at understanding the specific instructions when creating an image.

Yes, you are correct. But since it is not following the exact instructions, it becomes frustrating

1 Like

I have run inot the same issue I wanted to make my cousin look like hesriding a helicopter but no matter how many prompts I put and how many pictures I add it never looks like him and it always looks like a Model its like they make the person purposely 10x more attractive

Hi @utilityfog , thanks for your research, do we have any solution for this so far?