As far as I know the free moderation end point doesn’t check for this type of prompts specifically, I might be mistaken tho.
Something like the idea presented here should definitely be accompanied by the check provided by the moderation end point. probably moderation endpoint should come first in almost everycase.
Yes the idea to detect an adversarial prompts with regex sounds interesting and valid but would require quite a rigorous checking for all possibilities would not include the latest attempts if I am thinking about this correctly.