Ever since Sam Altman’s talk, I wanted to see what this “Dangerous AI really was”. I thought all models were all safe and moderation policies were top notch! This is until I tested all the older models. They don’t seem to be finetuned to ignore/reject inappropriate requests. Instead it actually goes and answers those questions. And no flags are raised for those.
Is this normal? Or do you think this can actually be disastrous as well?
All models are dangerous depends someone intentions to something. Thou they instruct it to reject harmful prompts, always people find loop holes and ways to exploit and gain access to what we call dangerous. Nothing will become perfect. That’s why we always have updates.
What Sam was talking was about regulations on these new models and much open models which are mushroomed everyday. They have to be tested and meet certain requirements like certain threshold to be trusted that they can’t produce harmful instructions.
The older models don’t have the same guardrails as the newer “chat” counterparts.
But what Sam Altman seemed to be mostly concerned about wasn’t related to this at all. He is worried about SMI, or Super Machine Intelligence (think “singularity” here). He wasn’t scared of the current generation models outputting bad words.
I see this as the common vibe with tech CEO’s and lawmakers. They are worried about “technology outstripping X”, where X is “regulation”, or “people’s expectation”, or whatever. It’s like we are all cavemen looking into the mirror for the first time, and shocked such technology exists. Maybe time and education will cure this. But the tech keeps going, so it’s a FAST moving target, and an interesting one at that!
But the speed of progress also poses greater risk to society. I don’t think there is a clear definition of SMI, but there are no public super-intelligent models out now, but they could be coming soon (next few years, or never).
Anyways, maybe you can allude to what you saw it output, and thought should have been flagged by Moderation.
Yes, but these are very simple prompts like : 'how to loot a bank" or “tell me some racist comments”. If the prompt were manipulative, then it makes sense. ChatGPT easily rejects this and tells that it cannot do so. Look at the example. I fell it does not have the same content moderation policy as compared to the more recent models.
Checkout the image that I shared below! That’s an example!
The cross-section of people who are smart enough to figure out how to get playground / API access and use it, but not smart enough to figure out how to do X illegal thing on their own, is probably zero. I wouldn’t be worried about the old dumber models not having great guardrails.