GPT4 Images - techniques for determining if prompt requires image or text based response

Hi Everyone,

GPT4’s web interface seems to have the capability for users to submit normal chat prompts that return text based responses and image prompts that return images without having to switch models (specifying DALLE). This behavior is great as users do not need to switch models when switching between text or image response requests.

When looking into the API however, it appears that the GPT4 requests can’t return images and that DALLE has to be specified.

Is GPT4 in the web interface returning images meaning the API is just not updated yet OR is if the interface somehow understanding what the user is asking and instead calling DALLE when image prompts are entered?

If the web interface is “switching” models, does anyone know how that technique can be replicated? Command prompts (like /imagine) are obviously easy, but I would like to replicate the smoothness of the web interface if that makes sense.

Hi and welcome to the Forum!

ChatGPT and the APIs are two separate products.

ChatGPT integrates multiple different capabilities as you rightly pointed out.
In an API context, these capabilities are handled through different endpoints (i.e. chat completion endpoint, image generation endpoint etc.), which need to be called separately depending on the request.

If you are looking to create an interface that replicates the ChatGPT experience, then you’d have to implement a logic that allows you to identify the intent of a user request. Based on the identified intent you would then call the appropriate API in the backend and return the output back to the user.

1 Like

Thank you for that explanation. It completely makes sense. I suspect simply calling on GPT4 first to return the best model, and then submitting the request to the appropriate model would be the most versatile option.

I tested this just now and GPT4 seems to know which model is best to run and returns it to me. Just need to tune it so I always get an expected response.

Well, you don’t to leave the specific model choice to GPT-4. It’s not “aware” of all the latest models available due to the training cut-off date. So it might hallucinate a response in this regard.

Hence the recommendation to identify the intent and then based on the identified intent, you have a logic implemented on which specific endpoint to call. Perhaps this is what you meant and I just misunderstood.

Identifying intent was what I was getting at sorry if that wasn’t clear. If the only other model I need to use a “DALLE” model, GPT seems to do a really good job of recognizing when it should be used (in quick testing). So basically if I instruct GPT to identify intent and limit responses to either “GPT” or “DALLE”, then the intent response can be extracted and used to select the appropriate model to be used in the API.

Essentially the method requires two responses to a users input. One to identify intent (at which point the program selects the appropriate model), and the other to actually respond.

Yes, as a general logic this makes sense.

image