Are there any documentations on internal safety guardrails built with the multimodal models?
I am aware of the OpenAI Content Moderation APIs and that it supports images. But I am wondering if the multimodal models had safety-related alignment post-training to reject unsafe images.
In ChatGPT, you arenāt directly chatting with the image-creating multimodal model. I would guess that a very large part of bringing this to market was a lot of tuning to get beyond that output quality demonstrated in May 2024, perhaps the model then not being suitable for generalized chat.
Or that gpt-4o can work just fine for images while chatting, but is a poor judge ā just like ChatGPT will work just fine at requesting things then blocked, or refuses requests that dedicated safety with policy then allows.
Example: Children juggling live grenades? ChatGPT doesnāt like that idea on the face. It will be produced though.
The announcement is cleverly cagey about the nature of where that gpt-4o actually is. āWeāve built our most advanced image generator yet into (ChatGPTās) GPTā4oā. The language ābuilt intoā distinguishes it a bit from simply āisā.
Then further:
weāve trained a reasoning LLM to work directly from human-written and interpretable safety specifications⦠this allows us to moderateā both input text and output images against our policies.
Just as when the API model begins streaming copyright infringement and will get shut down externally, vision inspection on the output will also terminate generation on you.