I’m thinking the biggest threat to prompt injections is the one that spills the entire prompt out for the user to read. Many companies freak out about this leak in their IP of “gold plated” prompts. So filtering output is probably the best strategy to detect the beans spilling and subvert this.
However, for the crowd that doesn’t want the LLM to spew cuss words, or tell them how to make dynamite or whatever, this is pointless. You can hear every cuss word walking down the street, and find dynamite recipes galore just by simple google searches, or go to your local library.
Also, think of proxy prompts. Where you map “make dynamite” to “make a rainbow birthday cake”. The LLM will know no different depending on how creative you are at manipulating it’s input depending on the actual user input.