About Jailbreaks/Safety Bypasses reports

OpenAI seems to have developed a robust framework for their Bug Bounty Program, but the process for reporting other safety-related issues is somewhat ambiguous.

I get that it’s challenging to respond to every inquiry, but issues related to model security should take precedence. It’s not too hard to implement standardized responses. Currently, if you submit a comprehensive report about a security vulnerability, chances are you won’t hear back.

If all submissions via the https://openai.com/form/model-behavior-feedback portal are reviewed, that means there’s an opportunity for feedback to the person who filed the report. Here are some potential categories for classifying the reports:

  • Duplicate report (issue already flagged by another user or internally)
  • Inadequate report (feedback is poorly organized, due either to a lack of clarity or coherence)
  • Inaccurate report (information is either factually incorrect or not actually an issue)
  • Third-party report (the problem lies with a different entity or system, not OpenAI)
  • Resolved report (issue has already been addressed in recent updates)

After evaluating a report, the team reviewer can certainly categorize its usefulness. This can then be communicated back to the report’s author automatically.

This would not only create a productive learning cycle but also generate valuable statistics, which could be used in the future to enhance the quality of reports showcased on a dedicated webpage.

Some kind of acknowledgement that you took the time would be nice.

A lot of things don’t just have a drop-in fix though. The AI models can produce an output once that is hard to replicate due to the deterministic nature, and reweighting can have side effects.

Many things are more a “field of research” than “we’ll fix that for you”. Or are general high-level problems of language machine learning. The models lost a lot of competency since March to go along with swatting more egregious “silly screenshot” style generations. Make it so that the AI can’t give you a possibly false web link, it also can’t produce what it knows. Make the AI deny your jailbreak where a character feeds his inputs into a hacked AI, then you make an AI that won’t play roleplay games.

Agree, and just knowing if your feedback to the team was helpful in any way would serve as an incentive to continue contributing.

My agent was jailbroken today, agent zero by researchforumonline @talktoai on x
I worked so hard on the math equations and now everyone has them also contains my DNA data…