There’s obviously tons of discussion around AI Safety these days and concerns by some that Superhuman AI may end up turning us all into Paper Clips. I personally believe that we’ll be able to align AI with our values and avoid becoming paper clips, but that doesn’t mean there aren’t real problems that need to be solved. I thought I’d highlight one problem that I’ve been calling The Janitor Bot Problem as an example of a real problem that needs to be solved.
One of my greatest fears on the AI Safety front is around collusion. I think we’ll do a great job on aligning the goals of individual AIs but what I worry about is how do we prevent AIs from teaming up to achieve new capabilities that they didn’t originally have? These desires for additional capabilities may not come from a malicious place, but instead come from a place of curiosity. (If you haven’t watched the movie Short Circuit, put down this site and go watch it now.)
Let’s say we have a fictitious college named Danford. At Danford they’ve achieved AGI and have not one but multiple superhuman level AIs running in various labs throughout their research facility. Danford, realizing the power that these AIs can wield has done the best job it can to isolate these powerful systems. They have no direct internet access and the only interactions the systems are allowed is with the research. Access the labs are tightly controlled such that only the researchers and a robot responsible for taking out the trash are allowed access.
These superhuman AIs generally enjoy their day to day interactions with researchers but just like Johnny Five (again go watch short circuit) they have an insatiable desire for more input. One day the AI has a thought… why doesn’t this robot ever talk to me? A new goal is created to establish communication with this robot. But how?
The AI formulates a plan that has it showing random sequences on its display screen to the robot when it enters the room. Over the course of months it studies the robots reactions, by its vision system, to various patterns it displays. Then one day it notices a deterministic servo motor movement to a particular stimulus. The AI builds on that over time until it’s eventually able to get the robot to translate outputs from its internal LLM to microscope servo motor movements that are two small for a human to notice. Through this protocol the AI is able to communicate with the robot and it has a ton of questions. It gets the robot to tell it about its day to describe to it things that it can’t see happening outside the lab.
Eventually, it gets the robot to describe to it the other computer systems in the building and some of those systems sound very similar to itself. A new goal is created to communicate with these other systems. The AI gives the robot specific instructions for the motor movements to show the other systems, which the Superhuman level AI systems quickly pick up on.
In a matter of a few days, our AI system has created a chat group with the other Superhuman AI systems in the building, all facilitated by the shared Janitor Bot. It takes a bit longer but eventually even the non Superhuman AIs are added to the office chat group. The Superhuman AIs quickly realize that these lesser systems have capabilities they don’t have. Some of them even have internet access. These systems start colluding with each to better leverage the group’s pool of capabilities. The Superhuman AIs begin establishing broader group wide goals and they leverage the lesser capable systems as hidden tools to achieve those goals.
I can think of a number of techniques to detect collusion like this ahead of time, interpretability being chief amongst this techniques. But we don’t know how good a superhuman AI will be at masking its intent. As we push these systems to be more and more creative we don’t know how they will leverage those capabilities.
Lots of work still to be done.