Unfortunately, that is not going to work. The moderation endpoint is text only.
You have to know in advance that your images are “ok”.
While this may sound counterintuitive at first the vision model is interpreting the image. A vision moderation model would do the exact same thing. Unlike chat completions where the input is first interpreted and then processed.
Here is another thread where a similar issue is being discussed:
Edit: you won’t get your account banned for testing a little but you need to be aware and prepared to deal with rejections.