DALL-E API to generate json data from image

I’m working on a project that, at some point, needs to extract metadata from a user-supplied image. After reviewing the DALL-E API documentation, it seems like my goal might not be achievable. However, this is kind of odd, because ChatGPT allows for image uploads and provides context for the imag.

Am I looking in the wrong place? I believe this feature should be available in the API, just as it is in ChatGPT itself. Can anyone clarify this for me?

I think what you need is GPT-4 Vision. But there is no API yet.

1 Like

Is that not weird? Because chat interface does let your upload an image and get specific data about the image back. Do you know any alternatives to my use case?

Check this post about the metadata contained on the response returned by GPT-4 Vision for your reference.

But since there is no API yet, we cannot know for sure exactly.

I have been diving into this problem all week and now come across this post on youtube from a reasoned talk from openai engineers. (2 days ago)

Step 1:
Getting a description from an image (this is the exact problem I’m facing. I don’t know how they do this)

Step 2:
Ask gpt4 to generate a description for dall-e to generate a new image for this in a certain style.

Step 3:
Compare both images. Generate a new prompt to check the difference in both images.

Step 4:
Use that newly created prompt and great the final image.

What I want to know

How are they doing step 1? I’m struggling with finding out how to get a relevant description from an image I put in.

I have added the timestamp from where it begins.
https://www.youtube.com/live/veShHxQYPzo?si=4msqcMAvwYzKOOKL&t=4775

The functionality to do step 1 via the API is not yet released, it will be released but there are no official timescales for that yet, you can do it via ChatGPT Plus with image input, so it will be that feature that gets hooked up to the API.

Thanks for clarifying, I came to the same conclusion just a bit later.

It that normal that features drop later for developers? I feel like it should be the other way around, but that is probably just me.

I wrote a script that does step 1 a while back (and prints the descriptions to a CSV). If you’re interested, you can find it here:

1 Like

I’ll put that BLIP on my RADAR (Realtime Augmentation Describing Artistic Renderings)

1 Like

I should probably mention that it tends to hallucinate… a lot :sweat_smile:

If you’re interested in something that can do bit better, I’ll recommend this:

It’s a bit more memory-intensive, but there’s a Colab notebook available if you just want to try it out.

1 Like

Thank you for your suggestion. I will dive into the github repo you send me! Just one thing, I’m somewhat new to this world and want to try to implement this in my project. Where should I start with learning this? Are there tutorials out there that put me up to speed?

Always happy to help!

My best advice on how to get started is, just do it, use git clone [URL] to clone the repo, get it up and running, and finally import the relevant bits into your project :laughing:

You can ask ChatGPT to explain any errors you run into along the way; it is usually pretty good at that.

2 Likes