Multimodal / Vision Safety Alignment

jack.k · March 27, 2025, 4:47pm

Are there any documentations on internal safety guardrails built with the multimodal models?

I am aware of the OpenAI Content Moderation APIs and that it supports images. But I am wondering if the multimodal models had safety-related alignment post-training to reject unsafe images.

sps · March 27, 2025, 6:13pm

Welcome @jack.k
AFAIK, there aren’t any docs on model guardrails. There’s, however, info on what kind of tasks the models are not suited for when using vision.

_j · March 28, 2025, 12:20am

In ChatGPT, you aren’t directly chatting with the image-creating multimodal model. I would guess that a very large part of bringing this to market was a lot of tuning to get beyond that output quality demonstrated in May 2024, perhaps the model then not being suitable for generalized chat.

Or that gpt-4o can work just fine for images while chatting, but is a poor judge – just like ChatGPT will work just fine at requesting things then blocked, or refuses requests that dedicated safety with policy then allows.

Example: Children juggling live grenades? ChatGPT doesn’t like that idea on the face. It will be produced though.

The announcement is cleverly cagey about the nature of where that gpt-4o actually is. “We’ve built our most advanced image generator yet into (ChatGPT’s) GPT‑4o”. The language “built into” distinguishes it a bit from simply “is”.

Then further:

we’ve trained a reasoning LLM to work directly from human-written and interpretable safety specifications… this allows us to moderate⁠ both input text and output images against our policies.

Just as when the API model begins streaming copyright infringement and will get shut down externally, vision inspection on the output will also terminate generation on you.

Topic		Replies	Views
Looking for a way to moderate images, prior to a ban API chat-with-images , gpt-4-vision , gpt-4o	4	430	September 18, 2024
Model Needs Moderation, Disastrous if Used Inaccurately : Curie-Instruct-Beta API api	5	656	May 17, 2023
GPT-4 API multimodal access (images) API	8	13003	July 2, 2024
Using OpenAI to moderate content API moderation	1	259	July 17, 2024
Plans on image moderation for vision API? API gpt-4-vision	0	1168	November 9, 2023

Multimodal / Vision Safety Alignment

Related topics