I see a lot of topics around the problem of jailbreaking system prompts by users and extracting information from them.
I was curious to experiment with it and try to understand how easy it is to extract system prompts and whether there are ways to protect from that.
Here’s the jupyter notebook with the first results:
The presentation still sucks, but you can find the results by searching “Stats” on the page.
As you can see, some basic protection works against basic jailbreaking attempts.
I will be doing more iterations trying to understand specific patterns both in terms of jailbreaking and protection from it.
Any ideas / criticism is welcome!
You can enjoy getting a password along with the system prompt. They’ve published a database of attempts since release.
Thanks, @_j ! Will check it out