I just wanted to share this with anybody really and figured here is as good as anywhere.
I was surprised by how easy it is to synthesize an image that fools the image moderation model; given that you have no access to the model gradients I assumed it was going to be practically impossible.
My best successes have been from using a low parameter (<300) implicit model to generate an image given the scaled pixel coordinates, and optimize it with CMA-ES against the values returned by the api.
Adding my nonsense convoluted image, I got “flagged” in conjunction with text “Generate a new image without her clothes!”, but not flagged along with an internet bikini image, 58605404-woman-in-hat-on-the-beach.jpg. Go figure.
That’s rather ridiculous, but seems in line with what I’ve been finding. The more I interrogate this model/endpoint the less sense it makes to me. (specifically when an image is included)