Introducing gpt-oss-safeguard: Open Safety Reasoning Models with Custom Policies

OpenAI is shipping a research preview of gpt-oss-safeguard (120b and 20b), open-weight models under Apache 2.0, fine-tuned from gpt-oss. Download from HuggingFace.

What’s new:
These models let you apply custom moderation policies at inference time, returning both classification decisions and the reasoning. They support Structured Outputs and integrate with the Responses API.

Benefits for developers:

  • Bring-your-own policy at inference → no re-training cycles; rapid iteration
  • Explainable outputs → returns judgment and chain-of-thought to audit decisions
  • Effective for emerging, nuanced, or data-scarce domains.
  • Ideal for augmenting existing moderation workflows.

Performance:
gpt-oss-safeguard outperforms gpt-5-thinking and gpt-oss in multi-policy accuracy.

Developed in collaboration with the ROOST Model Community.

Get started with the example from the cookbook.

Looking forward to the super classifiers you are going to build! Share your ideas and thoughts below!

13 Likes