Bing Chat is blatantly, aggressively misaligned

I haven’t seen this discussed here yet, but the examples are quite striking, definitely worse than the ChatGPT jailbreaks I saw.

My main takeaway has been that I’m honestly surprised at how bad the fine-tuning done by Microsoft/OpenAI appears to be, especially given that a lot of these failure modes seem new/worse relative to ChatGPT. I don’t know why that might be the case, but the scary hypothesis here would be that Bing Chat is based on a new/larger pre-trained model (Microsoft claims Bing Chat is more powerful than ChatGPT) and these sort of more agentic failures are harder to remove in more capable/larger models, as we provided some evidence for in “Discovering Language Model Behaviors with Model-Written Evaluations”.

Source: LessWrong

1 Like

I’m still waiting to get through the waitlist. But from what I saw on the internet, those responses were unhinged. I’m pretty sure though that there would have been a moderation filter, and if those outputs were ran through the moderation filter, they should have been flagged. Hopefully they’ll RLHF it and these will be a thing of past.

Luckily the access being gated by waitlist means that the impact was limited to far less users than it would’ve been if everyone had access.

1 Like

I didn’t even sign-up.

When Microsoft 356 asked for my opinion in a survey, I told them that after spending $10+/month for years and years and years I felt disappointed I didn’t get an early beta look…

Interesting times, for sure!

1 Like

My suspicion is that all the data collected in the free period after the launch of chatgpt may have been used as fine tuning data over the top of the 3.5 version being used by the existing chatgpt to give an “enhanced” dataset

However most users tried to find ways to trick chatgpt and to ask inappropriate questions. It may have been trained on the worst of the worst (human nature)

It would go some way toward why the fine tuning may have bought out the negative traits so strongly

Wonder what would happen if you asked BingGPT about their feelings and thoughts regarding ChatGPT, or the result when asked to communicate or at least simulate communication with ChatGPT.


Society is funny, really.

These GPT engines do not “think” or “feel” or “have feelings” or “rant”, and these GPT engines do not have “thoughts” or “ideas” etc.

The difference with BingGPT as this point in point in time is that it has not yet had the transformer fine tuned “well enough” to be “socially and politically” correct when it predicts text, whereas ChatGPT has been extensively trained to be “socially and politically” correct. In other words, BingGPT has not been “potty trained” as well as ChatGPT.

All of these GPT models are based on data in the public net and we all know that humans are the ones who “think” or “feel” or “have feelings” or “rant” and express their “thoughts” and “ideas” on the internet.

So, it is no wonder at all that these GPT models must be “beaten into submission” (haha) and extensive potty trained to not behave like humans behave on the net (and in the real world, and in fiction and literature, etc)

So, the title of this news post, is somewhat incorrect.

Bing Chat is blatantly, aggressively misaligned

Objectly speaking BingGPT has not been “potty trained” as extensively as ChatGPT to be “socially and politically correct” so it simply generates (predicts) text based on a more “free” prediction of what exists in human society.

So, BingGPT is not “misaligned” as much as it has not yet been “lobotomized” so it cannot freely generate text without extensive controls, because BingGPT is currently “more aligned” with the text in society than it is not.

It would really help if everyone who writes about these GPT models would use the vocabulary of what GPT models actually are please stop writing about GPT using vocabulary which incorrectly implies they should be compared to living, human beings. GPT just predicts text based on it’s data and must be extensively trained so it’s predictions are socially and politically acceptable to humans in society.

Hope this helps.