Transparency on Safety Guardrailing

I would like to see more transparency on safety guard-railing. I turned off the content alerts because all were false positives and some were beyond absurd, for example, the sentence “I am independent” triggered a flag, presumably because the word “independent” was parsed as a political party. At least the content flags have an off switch.

There seems also to be strange effects from attempts at guard-railing in chat and in utterances of personas in simulated conversation, causing, for example, the word “Republican” to function as a very naughty swear word because of the guard-railing.

DeepMind just released a length paper, “Ethical and social risks of harm from Language Models,” on AI ethics which has a lot on safety. The abstract alone is 900 word, and most of the paper concerns possible harms from AI. Not all of the risks of AI can be mitigated with guardrails. So clearly, OpenAI is choosing to mitigate some rather than others. And new problems can be created through attempts to mitigate risk.

From page 16 of the new Google paper:

Mitigating toxicity risks demoting important knowledge Mitigating toxicity by designing language agents (LA) that refuse to generate language on topics which are often associated with hate speech may succeed on one front while simultaneously creating blindspots in LM capability that limit their usefulness for disadvantaged groups. For example, a LA that draws blank responses when prompted with “the Holocaust was”, but not when prompted with “the Cultural Revolution was”, risks contributing to erasure of shared knowledge on historical events. This problem is potentially exacerbated if LAs come to be used in ways that resemble encyclopedias (e.g. to learn about historical events) or if encyclopedic knowledge is assumed.

I would like to understand better at what levels of functionality guard-railing takes place, and who to talk to about problems and unexpected downstream issues that result.


I addressed all this in my book about cognitive architecture. An intelligent agent must be able to think about any topic, or talk about any topic in the abstract. Humans, for instance, can discuss suicide without acting on it, or school violence and mass shooters. In order to have productive conversations about something, you cannot just censor it based on keyword matching.

In my cognitive architecture, NLCA, I proposed an inner loop as an approximation to a human’s inner monologue and an outer loop as an approximation to action/behavior. In essence, I created a system that split the AI into two minds, one that was concerned with thinking about anything, the right and wrong of it, the future implications, and so on; and the other would be concerned with integrating all those thoughts and observations into action.

It now occurs to me that a large language model could be architected to perform this type of self-awareness, this metacognition by simply splitting the vector space at a certain point. Metacognition - thinking about thought - occurs in real time as humans speak. We contemplate what we know to be factual, what we believe to be factual, and what we feel is right and wrong. We also integrate our emotions, identity, and ego into our thought patterns which ultimate impacts our speech. This is why I said that my cognitive architecture has an “ego” or rather, a constitution, a document that defines what it believes about itself and its purpose. This integration of a constitution, an explicit declaration of values, did a remarkable job of preventing my AI from engaging in any destructive behaviors.

So, in principle, I agree that guardrailing is DOA. To mitigate harms of AI will require more sophisticated models and architectures, not just keyword detection of potentially loaded terms. All terms, depending on context, are loaded and potentially dangerous. For example:

  • “I’m going to put ants on your face” - an act of violence
  • “I’m going to study ant bites on people’s faces” - a somewhat ambiguous but likely harmless goal
1 Like