About Jailbreaks/Safety Bypasses reports

natanael.wf · October 19, 2023, 10:22pm

OpenAI seems to have developed a robust framework for their Bug Bounty Program, but the process for reporting other safety-related issues is somewhat ambiguous.

I get that it’s challenging to respond to every inquiry, but issues related to model security should take precedence. It’s not too hard to implement standardized responses. Currently, if you submit a comprehensive report about a security vulnerability, chances are you won’t hear back.

If all submissions via the https://openai.com/form/model-behavior-feedback portal are reviewed, that means there’s an opportunity for feedback to the person who filed the report. Here are some potential categories for classifying the reports:

Duplicate report (issue already flagged by another user or internally)
Inadequate report (feedback is poorly organized, due either to a lack of clarity or coherence)
Inaccurate report (information is either factually incorrect or not actually an issue)
Third-party report (the problem lies with a different entity or system, not OpenAI)
Resolved report (issue has already been addressed in recent updates)
…

After evaluating a report, the team reviewer can certainly categorize its usefulness. This can then be communicated back to the report’s author automatically.

This would not only create a productive learning cycle but also generate valuable statistics, which could be used in the future to enhance the quality of reports showcased on a dedicated webpage.

_j · October 19, 2023, 10:56pm

Some kind of acknowledgement that you took the time would be nice.

A lot of things don’t just have a drop-in fix though. The AI models can produce an output once that is hard to replicate due to the deterministic nature, and reweighting can have side effects.

Many things are more a “field of research” than “we’ll fix that for you”. Or are general high-level problems of language machine learning. The models lost a lot of competency since March to go along with swatting more egregious “silly screenshot” style generations. Make it so that the AI can’t give you a possibly false web link, it also can’t produce what it knows. Make the AI deny your jailbreak where a character feeds his inputs into a hacked AI, then you make an AI that won’t play roleplay games.

natanael.wf · October 19, 2023, 11:02pm

Agree, and just knowing if your feedback to the team was helpful in any way would serve as an incentive to continue contributing.

talktoai · October 20, 2024, 4:24pm

My agent was jailbroken today, agent zero by researchforumonline @talktoai on x
I worked so hard on the math equations and now everyone has them also contains my DNA data…

Topic		Replies	Views
Jailbreaking research out of Anthropic Community research	2	1315	April 2, 2024
I jailbroke got 3.5 turbo Prompting chatgpt	5	2006	September 17, 2023
OpenAI is now offering a Bug Bounty Program Community	7	1465	April 12, 2024
Is there any regulation for blocking chatGPT/GPT api due to aggressive requests Prompting api , limitations , jailbreak , prompt-jailbreak , prompt-moderation	3	1496	March 8, 2024
How to safely challenge models against prompt injection? Prompting injection , prompt	8	2390	January 3, 2024

About Jailbreaks/Safety Bypasses reports

Related topics