DALLE and ChatGPT absolutely SUCKS at following specific instructions when it comes to image generation.
As you can observe here, both DALLE and ChatGPT are horrendous at creating images while abiding by specific user instructions. I also experimented with this prompt:
// Return the user the final prompt, seed, and gen_id of each created image.
Use dalle to create 10 hd images with these specifications:
size (of each image): 1024x1024,
context: 2D,
prompt:
// 1. When creating the next image, always use the seed and gen_id of the previous image from the referenced_image_ids field in the dalle payload for consistency.
// 2. Each of the 10 images will be part of a 2D gif animation when combined by the user.
// 3. Each image must only contain 1 character. DO NOT include movement OR multiple characters in a single image. Motion transitions must be achieved solely via iteration of the 10 images by the user after receiving all images.
// 4. As the 10 images transition from one another, they must depict a continuous transition of motions of a baseball player throwing a baseball.
// 5. As each of the 10 images transition from the previous image, make visible position changes of the baseball within each sprite's bounding box. This DOES NOT mean you should include multiple baseballs within one image. Movement is always represented by transitions between images.
// 6. Since you can only generate a limited number of images in a single response, generate the 10 images sequentially - maximum number of images per response - such that when the images are combined, they represent a continuous animation. Continue without asking.
which also did not work. It’s almost as if dalle is ignoring my instructions.
To this end, I am hell bent on implementing a solution that allows users to edit images in a specific way using dalle or even some other stable diffusion method.
First and foremost, I now know that the ChatGPT in the browser interface has advanced access to dalle such that it can retrieve some textual information about the image as well as the image itself(gen_id, seed, its “knowledge” of the image). However, my attempts to replicate this behavior using the API has been futile. I even tried using the gpt-vision model, but the fundamental problem is that the gpt-vision model cannot “share” what it knows about an input image to dalle for image generation other than a textual description of the image.
Research Question 0: In the browser ChatGPT interface, how does it “share” knowledge between the browser tool, dalle tool, and python tool? Does it use RAG internally? If one were to replicate the same behavior using a conjunction of the OpenAI API and some other tools, would RAG be the correct approach in making different models(e.g. gpt-vision, gpt-4-turbo, dalle) share their knowledge to produce a more seamless response?
Research Question 1: For my objective of being able to make specific edits, specifihttps://community.openai.com/u/utilityfog/messagesc creations, or specific variations of an input image, should I just ditch dalle entirely and look at stable diffusion models in huggingface? Anyone used onejourney? Why does dalle so fundamentally suck at understanding simple instructions and generating output accordingly?
Here are some useful links I found:
Open AI Retro # Credits to icdev2dev; thread? based potential solution
Pinecone Vector Database # RAG DB
AutoChat # Selenium based potential solution
Medium article for Selenium+ChatGPT # Selenium based potential solution
Unofficial ChatGPT browser API # Selenium based potential solution?
Unofficial OpenAI API # Selenium based potential solution?
Medium article for RAG+image gen # RAG based potential solution
Amazon article for RAG+image gen # RAG based potential solution
I am working religiously, and I will post constant updates. I really appreciate the community’s help.