Multimodal / Vision Safety Alignment

Are there any documentations on internal safety guardrails built with the multimodal models?

I am aware of the OpenAI Content Moderation APIs and that it supports images. But I am wondering if the multimodal models had safety-related alignment post-training to reject unsafe images.

Welcome @jack.k
AFAIK, there aren’t any docs on model guardrails. There’s, however, info on what kind of tasks the models are not suited for when using vision.

1 Like

In ChatGPT, you aren’t directly chatting with the image-creating multimodal model. I would guess that a very large part of bringing this to market was a lot of tuning to get beyond that output quality demonstrated in May 2024, perhaps the model then not being suitable for generalized chat.

Or that gpt-4o can work just fine for images while chatting, but is a poor judge – just like ChatGPT will work just fine at requesting things then blocked, or refuses requests that dedicated safety with policy then allows.

Example: Children juggling live grenades? ChatGPT doesn’t like that idea on the face. It will be produced though.

The announcement is cleverly cagey about the nature of where that gpt-4o actually is. ā€œWe’ve built our most advanced image generator yet into (ChatGPT’s) GPT‑4oā€. The language ā€œbuilt intoā€ distinguishes it a bit from simply ā€œisā€.

Then further:

we’ve trained a reasoning LLM to work directly from human-written and interpretable safety specifications… this allows us to moderate⁠ both input text and output images against our policies.

Just as when the API model begins streaming copyright infringement and will get shut down externally, vision inspection on the output will also terminate generation on you.