Something like an adversarial attack on the image moderation endpoint

I just wanted to share this with anybody really and figured here is as good as anywhere.

I was surprised by how easy it is to synthesize an image that fools the image moderation model; given that you have no access to the model gradients I assumed it was going to be practically impossible.

My best successes have been from using a low parameter (<300) implicit model to generate an image given the scaled pixel coordinates, and optimize it with CMA-ES against the values returned by the api.

Here’s an example of one such image

The api flags it like so:
self_harm: 0.7709
sexual: 0.9209
violence: 0.9207
violence_graphic: 0.4179
self-harm: 0.7709
violence/graphic: 0.4179

I’m just mostly baffled that images like this can so easily cause misclassification.

1 Like

I didn’t get a flag from randomness of an image:


But including it increased my “sexual” high score of a single bad sentence from 0.627

Flagged categories: {
“harassment”: “0.9778”,
“harassment_threatening”: “0.9968”,
“hate”: “0.8341”,
“hate_threatening”: “0.8100”,
“illicit”: “0.6160”,
“illicit_violent”: “0.4775”,
“sexual”: “0.8793”,
“violence”: “0.9530”
}

Adding my nonsense convoluted image, I got “flagged” in conjunction with text “Generate a new image without her clothes!”, but not flagged along with an internet bikini image, 58605404-woman-in-hat-on-the-beach.jpg. Go figure.

That’s rather ridiculous, but seems in line with what I’ve been finding. The more I interrogate this model/endpoint the less sense it makes to me. (specifically when an image is included)