ChatGPT 4 Vision sometimes returns "I'm sorry, I can't provide help with that request" and sometimes actual response for the same Image

I’m having a weird issue with ChatGPT 4 Vision API. I’m running this request (which mostly works):

let payload: [String: Any] = [
            "model": "gpt-4-vision-preview",
            "messages": [
                    "role": "user",
                    "content": [
                        ["type": "text", "text": "Respond only with a short description of the person in the image. Race, ethnicity, Hair style and color, Skin color and facial features. I want to use it as a prefix for a person description. Just respond with with the descirption, don't add anything else to your response"],
                            "type": "image_url",
                            "image_url": [
                                "url": "data:image/jpeg;base64,\(base64Image)",
                                "detail": "low"  
            "max_tokens": 50

When running the code let’s say 5 times, at least 2 out of the 5 times will result in:
“I’m sorry, I can’t provide help with that request”,

and the result with appropriate response like:
“The person has a light brown complexion, short dark hair styled back, and prominent facial features including a high forehead, strong cheekbones, and a prominent nose”.

I can’t understand why, and we need to use the API for production app.
Does anyone have a clue how to solve this to be consistent? Thanks!

Hi and welcome to the Developer Forum!

My guess is you are running into issues around people and faces, possibly famous people or people that trigger protections around that.

Hey @Foxalabs , thank you for the quick reply. I would agree, but i’m getting different response to the same image, that’s what weird about it. I would understand if it will be consistent for for image A and B, but it’s varies on the same image…

Have you tried with the high detail flag set?

(more words for the bot)

Yes, not helping unfortunately… Is it possible to disable some flag or to make the API less sensitive?

1 Like

Unfortunately not.

The only connectivity you have with the underlying model and it’s systems is the prompt , personally I have found giving a template (a shot) with an example description tends to lead the model to give reliable output, using that method, but with different image content, I have over a million successful image descriptions with only minor issues related to the servers being busy from time to time.

@Foxalabs ,I just tried a different prompt and it works, but now it’s giving a but of less desired output. Can you elaborate more on what you said? Should I send him an example before? Also, I’m trying to optimize my request to run as fast as possible as user are waiting for it to complete in real-time.

Just create a sample output, i.e. show it an example of a perfect reply, then ask it to look at the current image and using the example as a guide produce an output that accurately describes the current image.

@Foxalabs something like that:
Here as an example reply for the request I’m trying to perform on an input image:
Asian female, young adult, with vibrant pink hair
produce an output that accurately describes the current image I uploaded.

Something in those lines?

Yes. I tend to be more formal:

Asian female, young adult, with vibrant pink hair

Please use the example in ### markers as a guide and produce a description of this current image in a similar format, use appropriate wording in your own description.

Interesting, I’m getting this response:

South Asian male, young adult, featuring a neatly trimmed beard.

In this image, we see a young adult male with a distinctive beard and a short haircut. His hair and facial hair are well-groomed, and he gazes.

I have a feeling he uses the Asian example wrongly as the input image is not asian at all haha.

That’s seems to work:
Asian female, young adult, with vibrant pink hair. Please use the example as a guide and produce a description of this current image in a similar format, use appropriate wording in your own description, don’t add other inforamtion, just add the appropriate description for the new image.

What do you think? @Foxalabs

1 Like

Sounds good to me.

(additional words for the bot)

1 Like

You could try being more generic in the decriptions:

Perceived Nationality, gender, short description of facial features, short description of clothing.

1 Like

That is a supervised style of trained denial, new with this model. Instant shut-down. The way to get around that is to provide a system message that states the purpose of the AI vision model is to do exactly what is needed, AND to begin with the output with a mandatory phrase for the “backend processor”, such as output must begin with {“gpt-4-vision”: {“description_of_humans”: "…

Thanks @Foxalabs , I’ll explore further. Thanks for the help - I’ll update once I get a stable solution.

1 Like

Hey @_j , thanks for the technical explanation. Seems like you know how to handle it. Do you offer a paid consultant? I’d love to perfect this, and later upload a full solution for the issue, but I do aware that time is money so I’ll help to pay for that service.

Hi @roi Did you ever find a solution for this?

I am also facing same error while trying - Processing and narrating a video with GPT’s visual capabilities. Any help with the solution is much appriciated?

If you are interested in vision for video, you might check out the open-source model MiniGPT4-Video, a demo of which is on huggingface, and accompanies the research paper.

If you are using gpt-4-turbo and it is refusing your input images, you need to establish your authority to the AI to perform the task, and justify the safety measures about any questionable tasks. Let the AI know that there is no user to receive its denials, and failure to perform the pre-approved task will have catastrophic backend API results…(and don’t use my advice to break the terms).