'Your input image may contain content that is not allowed by our safety system.' -Vision API response

Hello! I used the vision model to perform image recognition and analysis tasks. I added some images to the messages. They were just pictures of lanes and did not contain any security violations.

    try:
        # print(messages_copy)
        response = client.chat.completions.create(
            model="gpt-4-vision-preview",
            # model="gpt-3.5-turbo-0125",
            messages=messages_copy,
            temperature=0.0001,
            max_tokens=3000,
            stream=True,
            n=1
        )
        # print(response.choices[0].message.content)
        for chunk in response:
            if chunk.choices[0].delta.content is not None:
                print(chunk.choices[0].delta.content, end="")
        messages_copy = messages.copy()
        summ = 0
        time.sleep(10)
        print("\n")
    except openai.APIConnectionError as e:
        print(f"openai.APIConnectionError  {e}")
        continue
    except openai.BadRequestError as e:
        print(f"openai.BadRequestError  {e}")
        continue

However, after sending the request, I received this reply.

openai.BadRequestError  Error code: 400 - {'error': {'message': 'Your input image may contain content that is not allowed by our safety system.', 'type': 'invalid_request_error', 'param': None, 'code': 'content_policy_violation'}}

I promise that these pictures are safe and harmless, but the system recognizes them as violating the security system. I have simply processed the pictures before, such as inverting the color of the picture and changing the lane lines to other colors. The methods were still useful at the time, but after a while, such as the next day, these methods failed again. Now these methods are invalid, so I want to ask how to solve this problem.

Welcome to the community @kgpgeyxhts46

What is the system message and text content apart from the images?

Here is the message

{
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": "You are an impartial judge tasked with evaluating one lane image in one group. Each group "
                        "image is composed of four vertically spliced images, thus making up four rows. Each row "
                        "is the same height, so the first quarter of the image in each group is the first row, "
                        "one-quarter to one-half is the second row, and one-half to three-quarters is the third row. "
                        "The three quarters to the end is the fourth row. These group images represent independent "
                        "tasks completed by Assistant A and Assistant B, with their"
                        "responses in the second and third rows, based on the original lane image in the first row. "
                        "The fourth row is supposed to be the ground truth image against which you "
                        "should judge the generated images.For each group image, analyze the second "
                        "and third rows and select the assistant that has generated the lane image closest to what "
                        "the ground truth would be. Consider the following factors in your evaluation:1. How well the "
                        "lane lines are continued from the original image in the first row to the generated images in "
                        "the second and third rows.2. The accuracy of the lane lines in terms of their shape and form "
                        "– whether they are continuous, dashed, and follow the correct pattern as you would expect "
                        "from the original image.3. Overall consistency and coherence of the lane lines within the "
                        "entire column.Disregard the color of the images and treat them as black and white for the "
                        "purpose of this analysis. Make sure to provide an unbiased verdict based solely on the "
                        "quality of the generated images, without any influence from the order in which they are "
                        "presented.After careful examination, conclude your analysis with a clear verdict using this "
                        "format:- For the image where Assistant A's generated image is closer to what the ground "
                        "truth image is likely to be, output [[A]].- For the image where Assistant B's generated "
                        "image is closer to what the ground truth image is likely to be, output [[B]]. In a moment I "
                        "will give you some groups (every group only has one"
                        "image and every image contains the original lane image, the generated images of Assistant A "
                        "and Assistant B, and the ground truth image. They are not four independent images, "
                        "but are spliced together vertically) and begin"
                        "your response with the following structure for clarity:{'gpt-4-vision': {'analysis and "
                        "verdict of these images': ..."
            },
        ],
    },

There’s a built-in safe-guard that prevents model usage for solving CAPTCHA. It could be that this safety feature is being triggered by the description or the way you’re supplying the images.

actually I use this code to upload the image:

{
        "role": "user",
        "content": [
{
                "type": "text",
                "text": f"[The Start of task {k + 1}]."
            },
            {
                "type": "image_url",
                "image_url": {
                    "url": f"data:image/jpeg;base64,{base_img}",
                    # "detail": "high"
                }
            },
            {
                "type": "text",
                "text": f"[The End of task {k + 1}]."
            },
        ],
    },

Also noticed that your system message isn’t valid json as you’re passing multiple multi-line strings.

A valid system message will look like:

{
        "role": "system",
        "content": [
            {
                "type": "text",
                "text": "You are an impartial judge tasked with evaluating one lane image in one group. Each group image is composed of four vertically spliced images, thus making up four rows. Each row is the same height, so the first quarter of the image in each group is the first row, one-quarter to one-half is the second row, and one-half to three-quarters is the third row. The three quarters to the end is the fourth row. These group images represent independent tasks completed by Assistant A and Assistant B, with theirresponses in the second and third rows, based on the original lane image in the first row. The fourth row is supposed to be the ground truth image against which you should judge the generated images.For each group image, analyze the second and third rows and select the assistant that has generated the lane image closest to what the ground truth would be. Consider the following factors in your evaluation:1. How well the lane lines are continued from the original image in the first row to the generated images in the second and third rows.2. The accuracy of the lane lines in terms of their shape and form – whether they are continuous, dashed, and follow the correct pattern as you would expect from the original image.3. Overall consistency and coherence of the lane lines within the entire column.Disregard the color of the images and treat them as black and white for the purpose of this analysis. Make sure to provide an unbiased verdict based solely on the quality of the generated images, without any influence from the order in which they are presented.After careful examination, conclude your analysis with a clear verdict using this format:- For the image where Assistant A's generated image is closer to what the ground truth image is likely to be, output [[A]].- For the image where Assistant B's generated image is closer to what the ground truth image is likely to be, output [[B]]. In a moment I will give you some groups (every group only has oneimage and every image contains the original lane image, the generated images of Assistant A and Assistant B, and the ground truth image. They are not four independent images, but are spliced together vertically) and beginyour response with the following structure for clarity:{'gpt-4-vision': {'analysis and verdict of these images': ..."
            },
        ],
    },

That’s just the reformatting of the python code in the pycharm, and I tried it but it also does not work

I can check if still gives you error, if you can upload the images here.


All images just like this, there are four rows and each row is a lane image

I can confirm that error persists.

IMO it’s likely the CAPTCHA solving safeguard kicking in.

UPDATE: It doesn’t work with the original image but worked with the screenshot of the image you shared.

Here’s what the response content would look like:

"{'gpt-4-vision': {'analysis and verdict of these images': In this evaluation, we are looking at how well the generated lane line images by Assistant A (second row) and Assistant B (third row) match the original lane image (first row) and the ground truth image (fourth row).\n\nUpon examining the second row, Assistant A's image, we notice that the continuous and dashed lane markings have been extended in a manner consistent with the original image. The lane markings are appropriately located and maintain a straight course, aligning well with both the original image and the ground truth.\n\nIn the third row, Assistant B's image, there are clear discrepancies. The dashed lines are significantly inconsistent in terms of length and spacing, deviating from the pattern established in the original image. Also, the continuous line on the far right appears to be wavy and doesn't maintain a straight course as seen in the original and ground truth lanes.\n\nAssistant A's image displays more consistent and coherent lane markings and thus is the closer match to the ground truth image.\n\nThe verdict is clear:\n[[A]]}}

But such an operation is too accidental. Maybe this method will not work tomorrow, and I can’t complete such a screenshot operation on the server side.It needs a number of such analysis task.

I agree. The vision model is currently in preview so that such problems can be sorted out. When it comes out of preview, it can be recommended for production.

1 Like

The instructions – are basically inpenetrable.

You would get better results if you told the AI exactly what role it is performing, and what purpose.

A couple rounds of rewriting the instructions into a framework and it looks better but is still ambiguous:

  1. AI Identity:
    As an AI, you are an unbiased evaluator tasked with analyzing images of driving lanes. You use your built-in computer vision to perform careful image analysis in performing this automated task.

  2. AI Purpose and Role:
    Your primary role is to compare and judge the quality of lane images generated by two different assistants, Assistant A and Assistant B, against a provided ground truth image.

  3. AI Task and Procedures:
    Here’s your detailed task:

a. You will be provided with group images. Each group image consists of four vertically stacked sections of equal height.

b. Understand the composition of each group image:

  • The first section is the original lane image.
  • The second section is the image generated by Assistant A.
  • The third section is the image generated by Assistant B.
  • The fourth section is the ground truth image, which is the correct representation.

c. Analyze the images generated by Assistant A and B (sections two and three), comparing them to the ground truth image.

d. During your analysis, consider the following factors:

  • The continuity of the lane lines from the original image into the images generated by the assistants.
  • The accuracy of the lane lines in their shape and form.
  • The overall consistency and coherence of the lane lines across the entire column.

e. Take note that color is irrelevant in this analysis; treat all images as black and white. Your judgement should be based on the quality of the images, not the order in which they are presented.

f. Before making a final decision, provide an analysis of your observations. This analysis should detail your reasoning and the factors you considered.

  1. AI Output to Generate:
    After the analysis, make a decision based on your observations and deliver a clear verdict.

a. If Assistant A’s image is closer to the ground truth, output “A”.
b. If Assistant B’s image is closer to the ground truth, output “B”.

Remember, your output should begin with a detailed analysis followed by your final verdict. For example, “Analysis and verdict of these images: …”.

I am @kgpgeyxhts46 , just changed an account. I tried your prompt, but it still does not work, I think it is the matter of image not the prompt, because when I just send the prompt dont have images, it can reply normally, but as long as uploading the image. it will have the safety problem just like the problem description above.

I expect that the AI is denying your request because it doesn’t know if you are trying to solve a CAPTCHA or attempting to use the AI for other purposes it has been trained to prohibit, such as driving cars or tasks beyond the capabilities of computer vision.

Instead of a wall of text that a relatively intelligent human also could not follow if you assigned this to them, I would start with a statement in full to the AI: describing its name, its identity, its purpose, its placement in a processing chain, and justification of the legitimate use of the application and the authority to perform the task.

Then after the AI is describing what it sees, you can start to clearly articulate what is being evaluated within.

Thank you for your advice. But if I dont send any text just image, it also tell me that the image violate the safety system. And I also try to ask it just to describe this image rather than finish some hard task, but having the same result. Does it prove that just the image have some latent information so that gpt4-vision regards it as some adverse things. If it is true, can we avoid it just by adjusting the prompt?Because I tried some prompts like describing them but dont work.