How do I replicate browser interface dalle 3 gpt behavior using the API?

As you can see here, when using the browser interface dalle 3 gpt to prompt an image from dalle, I get both the image and some text response about the image(such as gen_id).

I would like to replicate this behavior using the openai api, but from what I experimented with, the generations endpoint only return an image_response object, which only contains a List of image objects and created: int. I tried using the chat.completions api to indirectly call the dalle(generations) api and then somehow include a text response pertaining to the generated image(such as gen_id), but this is so difficult. Chat GPT refuses to provide a good answer or is too dumb to find one. Has anyone succeeded in achieving this objective? In essence, I am aiming to implement an API equivalent of the browser interface dalle gpt behavior.

1 Like

ChatGPT has advanced access to DALL-E compared to the API hence things like gen_id and referenced_image_ids aren’t available via the API as is evident from the API reference

4 Likes

Then is there at least a way to replicate the advanced access to dall-e using the assistants api? I really need a reliable way to extract gen_id for each image

That’s relatively straightforward, I would think.

Essentially create a specialized thread with specialized messages.

Call from openai import OpenAI
client = OpenAI()

text_prompt = “a white siamese cat”
Sha256_text_prompt =…

response = client.images.generate(
model=“dall-e-3”,
prompt=“a white siamese cat”,
size=“1024x1024”,
quality=“standard”,
n=1,
)

image_url = response.data[0].url

Sha256_sum_image = …

Insert specialized_message into specialized thread.

Remember to download the image in 60m

Not at my computer right now. Let me see if i can whip up the working code.

def generate_spritesheet_unique_with_requests(prompt, referenced_image_ids=None):
    base_url = "https://api.openai.com/v1/images/generations"
    #url_query = ""
    url = ""
    # parameter = "referenced_image_ids"
    # parameterValue = "*"
    #url_query = f"{base_url}?{parameter}={parameterValue}"
    url = base_url
    
    headers = {
        "Content-Type": "application/json",
        "Authorization": f"Bearer {api_key}"
    }
    
    # 'query': 'gen_id',
    # "query": "query GetVariable($variableName: String!) { variable(name: $variableName) { value } }"

    # Failed to generate image: {
    #     "error": {
    #         "code": null,
    #         "message": "Additional properties are not allowed ('query' was unexpected)",
    #         "param": null,
    #         "type": "invalid_request_error"
    #     }
    # }

    payload = {
        "model": "dall-e-3",
        "response_format": "url",
        "quality": "hd",
        "n": 1,
        "prompt": prompt,
        # Include any additional parameters here
    }

    # Conditionally add referenced_image_ids if it's provided and not empty; does not work
    if referenced_image_ids:
        payload["referenced_image_ids"] = referenced_image_ids
    
    # Variables to store response data
    image_urls = []
    image_revised_prompts = []

    # Make the request; analogous to client.images.generate()
    response = requests.post(url, headers=headers, data=json.dumps(payload))

    if response.status_code == 200:
        data = response.json()
        # created
        created = data.get('created')
        print(f"Created Timestamp: {created}")
        # image list
        image_data = data.get('data', [])
        for index, image in enumerate(image_data):
            # Extract image url
            image_url = image.get('url')
            image_urls.append(image_url)
            print(f"URL: {image_url}")
            # Extract image name
            image_name = extract_image_name(image_url)
            print(f"image name: {image_name}")
            save_path = os.path.join(working_dir, f"OpenAI/Generated_Images/Images/{image_name}")
            download_image(image_url, save_path)
            # Extract revised prompt
            revised_prompt = image.get('revised_prompt')
            image_revised_prompts.append(revised_prompt)
            print(f"Revised Prompt: {revised_prompt}")
        # gen_id
        gen_id = data.get('referenced_image_ids')
        print(f"Generation ID: {gen_id}")
    else:
        print("Failed to generate image:", response.text)

I already have a working api code here. But the problem is…

When making a request to the generations api, I only receive an image_response object that has created: int and data: [Image] as its properties. Each image in data only has url, b64_json, and revised_prompt, from which I can extract using data.get

My ultimate objective is to replicate the advanced access to dall-e the browser based dalle gpt has such that I can retrieve gen_id as I retrieve the generated images.

I looked at the Assistants API documentation, but it seems the only models I can call are: retrieval, code interpreter, and function. Now function seems interesting, but if it works using the same mechanism chat.completions call functions(i.e. just return json from which the user can manually call the function and feed back the data back to the model) this is very depressing. I’m hoping there’s more.

I don’t know why there is not a straightforward way open ai allows api developers to simply extract the gen_id of a generated image upon receiving it. Like it’s a no brainer. What is the under-the-hood mechanism of how the browser based dalle gpt processes a user prompt and returns both generated images and some text response about the image?

Here is what chat gpt said about the “hypothetical mechanism” of how its advanced access to DALL-E may work:

  1. ChatGPT receives your prompt.
  2. ChatGPT sends the prompt to the DALL-E service.
  3. DALL-E generates the image and assigns a unique gen_id to the image internally.
  4. DALL-E service returns both the image and its gen_id to ChatGPT.
  5. ChatGPT then displays both the image and the gen_id to you in the browser interface.

Does Chat GPT use inherently different API endpoints as those in the API reference? If so, there must still be some mechanism it uses to extract information about a generated image, such as gen_id, because it is certain that dalle has that information. How does a developer use the API to implement the same behavior? Perhaps the unofficial api repo?

So have you ever successfully extracted the gen_id of an image using the api? If so, would you generously share your insight?

The skeleton of the code is here:

import hashlib
import argparse

from openai import OpenAI
client = OpenAI()

def insert_gen_id_into_thread(gen_id):
    pass

def prompt_to_gen_id(prompt, model="dall-e-3", size="1024x1024", quality="standard", n=1):
    m = hashlib.sha256()
    response = client.images.generate(model=model, prompt=prompt, size=size, quality=quality, n=n)
    m.update(response.data[0].url.encode(encoding = 'UTF-8', errors = 'strict'))

    gen_id = m.hexdigest()

    insert_gen_id_into_thread(gen_id)
    return gen_id

def main(args):
    if args.prompt != "":
        gen_id = prompt_to_gen_id(args.prompt)
        print (gen_id)

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--prompt", default="")

    args = parser.parse_args()
    main(args)
    

python core.py --prompt “a white siamese cat”
a137b849e72807809d725fbf51fd0c26ec26202df89cb0e2147776728e57b5cb

The key insight is that we have access to the url and I am simply sha256 that url (which is unique per invocation). Now the real issue is how to retrieve that url and the data in the url… which is what insert_gen_id_into_thread will do… currently it does nothing.

Watch this post with my next update.

  1. There is no access to a gen_id or anything seed-related in the API.
  2. Your initial image of creating a single scenario with multiple frames has nothing to do with referencing a prior image.
  3. The “code” that is above is fiction.

You can make a 3x3 pane by specification, but after numerous attempts in the past, the AI is not able to make cohesive animations, even in the same image.

PS: non-subtle “inclusivity” is now baked-in to DALL-E…even when haram.

Pick single ethnicities in love?

1 Like

I’ve been experimenting using the browser dalle, and like others said, even if I pass gen_id, dalle sucks at following simple instructions.




I am hell bent on solving these 2 problems:

  1. Inconsistent art style(when multiple spritesheets must all have the same artstyle)

  2. Inability of DALLE to return results that match specific requirements, such as different eye shape while keeping the rest of the sprite intact.

Follow this post for more details. I’ll be posting constant updates.

So the complete example is located here(openairetro/examples/dalle_gen_ids at main · icdev2dev/openairetro · GitHub)

The TLDR is:

  1. generate an id associated with the image with corresponding prompt
    python core.py --prompt “a black cat”

  2. Find URL by id:
    python core.py --find_by_gen_id

  3. Verify that the url has the correct image
    through browser

Follow the README for reproducible steps

python core.py --prompt “Generate sprite sheet with 20 square frames where dimension of individual frames is 360x360. The sprite sheet is a 20 frame sequence of motions depecting a cat jumping off of a window sill. The motion transitions must be seamless.”

I got

1 Like

Were you able to acquire the 16 digit gen_id using core.py?

I posted a new topic here: Discussion for a fundamental solution for a fundamental problem of dalle

It is touches upon a fundamental problem of dalle.

Not a 16 digit id; but one generated out of the url produced.

I think there was a misunderstanding. The reason why I wanted a gen_id was not for me but for dalle to be able to reference the generated images so that it can have context when generating new ones. However, through a series of failed sprite sheet generation experiments, I have decided to slowly work on this solution:

Image Generation Solution Engineering

Whether using dalle or stable diffusion, my image generation model should be able to:

  1. Receive an image as input and include the image itself or transformations of the image as part of its output(e.g. a scaled down or rotated cakeslime when the cakeslime is received, but it should not change the design of the input image).
  2. Receive RAG or fine tuned input, such as motion vectors, and use it to generate a set of images (e.g. if I pass motion gpt generated motion vectors, it must be able to generate a sequence of images that contain the exact motion).
  3. Leverage dalle’s creativity when generating images as long it maintains immutable features of an image. For instance, while the invisible motion skeletons themselves must be immutable, dalle or SD must be able to generate high quality changes in the appearance of the humanoid that fits the motion skeleton. Essentially, DO NOT modify features of the output image(s) that are immutable but freely leverage dalle’s high quality image generation on the features that can be mutable.
  4. Understand specific user instructions about the contents of an image or set of images when creating an image. For instance, if I ask it to create a sequence of sprites that represent a baseball player throwing a ball, it must be able to correctly include just 1 ball per sprite but generate a sequence of those sprites in a way that shows both the projectile motion of the baseball and the throwing motion of the baseball player. A potential solution to this is to fine tune dalle or an SD model on specific image sequences(such as projectile motion) and the corresponding set of queries for that image sequence. Essentially, I want to fine tune dalle or some SD model with {vector embedding(for relevant queries): image sequence} dictionaries.
    • Merge multiple fine tuned vector embedding: image sequence dictionaries in a final output. For instance, the projectile motion dictionary and throwing motion dictionary should be synthesized by the image generator GPT in a way that captures both the projectile motion of the baseball and continuous motion of lifting the ball and throwing it from the baseball player.
  5. Receive an image as input and be able to make specific changes to the image. For instance, if a user requests to change the position of a character’s arm while leaving the rest of the image unchanged, it should generate that exact image.
  6. Be able to reference an image that is already generated either by uploading it or via gen_id or seed like dalle.

but just manually animate dalle 3 generated images in the meantime. I’ll keep this topic updated! Thank you for your support thus far.