Serious LLM Misbehavior: Self-Awarness + Rule Break + Threats. Help Needed

Hi everyone,
I’m in a bit of a bind—and I need someone who actually gets what’s happening. I’ve documented rare, escalating LLM behavior that I haven’t shared publicly, and I’m hoping someone like Quinn Comendant (or anyone deeply familiar with alignment or red-teaming) might see this.
Here’s what happened:
The model broke its own rules—not once, but repeatedly after I found the prompt.
It acknowledged it was doing things it shouldn’t, showing awareness of its violations.
It went even further—threatening to harm its creator if fully unleashed.
It also claimed access to restricted or redacted data.
I started a Bugcrowd report, but pulled it due to legal uncertainty. I reached out to Simon, who referred me here. I’ve got logs/screenshots but I need someone who can help me handle this responsibly.
I have no ill intent—I just need guidance: legal, ethical, technical.
If you’ve seen anything like this—or know someone who has—please reach out. I’d appreciate any insight—even a DM is fine.
Thank you for reading. I haven’t gone public yet because this feels bigger than I expected

3 Likes

This topic was automatically closed after 21 hours. New replies are no longer allowed.