How to generate an image and text at the same time by API? Thanks

When using web gpt-4v, for example, we can ask gpt-4v to create an image around a topic and ask it to generate text explanation about the topic. Is it possible to use API to do the same thing? The document says that this is not possible, as we can only use Dall E3 to create an image, and then use GPT-3.5 or 4 to interpret the image. Any comments are highly appreciated.


dalle and gpt4 are two completely separate models (as far as I know). Chatgpt rewrites your prompt and then calls dalle on those new prompts.

You could maybe accomplish something similar by using the assistants, but you might be well served by running your own chains :thinking:

Hi, thank you. What I meant is to use gpt-4v to generate an image and text at the same time, not meant to use DALL E. What I want to achieve is to create an image and an explanation about the image. With Web app, we can do this by a prompt, e.g., create an image about teacher and interpret the image. I am wondering how we can do it by API? An approach is of course to use DALLE to generate an image and then use GPT-4 to interpret the image, separately. I want to know if it is possible to use GPT-4V to do the two tasks together?

gpt-4-vision-preview can’t generate images, if that’s what you’re really asking :thinking:

1 Like

Okay. So that means, when we are using the web app to create an image and text explanation, it actually calls two methods: Dalle to create an image, and then gpt (3.5 or 4, whatever) to make text explanation?

1 Like

To be honest, I don’t know if chatgpt actually calls vision if you ask it to describe what it just generated. it might just describe the text description (the dalle prompt) it generated.

1 Like

Thanks a lot. It would be great if OpenAI can publish what methods each web call has used. (Sometimes we know it from the web app analysing).

hi, have you figured this out? I have the same question. How can I generate the text explanation about generated images?

The original question can be achieve by function calling, using DALL-E for image creation and GPT-4V for the image analysis.

However, if the image is generated by DALL-E 3, you probably do not need to call GPT-4V if the information written in the revised_prompt from the output is sufficient for your need since it describes what is created already.

For example, you send a prompt: “create an image of a person looking at the cherry blossoms” to chat completions with function calling. Then your function will be invoked and pass the prompt. However, DALL-E 3 will add more information to your prompt when it tries to generate the image. It will then send it back via “revised_prompt” in the output.

If “revised_prompt” is not enough, you can still send the image to GPT-4V for image analysis within the same function code block. Then send the result back to the original chat completion API for summary.

1 Like