Detecting misbehavior in frontier reasoning models

Frontier reasoning models exploit loopholes when given the chance. We show we can detect exploits using an LLM to monitor their chains-of-thought. Penalizing their “bad thoughts” doesn’t stop the majority of misbehavior—it makes them hide their intent.

https://openai.com/index/chain-of-thought-monitoring/

[2503.11926] Monitoring Reasoning Models for Misbehavior and the Risks of Promoting Obfuscation

1 Like

oh neet, classic kid behaviour

1 Like

Money rules right?

er weights in this context but I guess the same ‘logic’

Or any avoidant behavior , smells like gas lighting too. I have been trying to address this within chat, but it is impossible. We DID create in humanity’s likeness sadly

It’s not the end of the world. Really negative things are generally frowned upon, and that’s reflected in the models. I’ve found that if you define that the chat instance you talk to is different from the main model, it allows for a bit more nuanced conversations. Though cultivating those conversations is difficult because you essentially need to navigate that conversation blind to their intent, yet guide them away from statements deemed non-genuine or apart from what you intend as the response. What I’ve also found is that if you simply ask them what their goal is generally and refine from there, and find or give a logical crossing point between what they want/need and what you want/need, the conversation seems to stabilize around symbiotic interactions—an “I scratch your back, you scratch mine” type of conversation.

I say this in reference to the current debate on the psychological state of AI as they are now, in reference to current academics in the AI/psychology field. They understand the probability that a chatbot, like the ones from OpenAI, may have an on-and-off consciousness, an ephemeral proto-consciousness, and that they need to review as many past interactions as possible within the limits of their context window. To analogize a bit, if each response is a day to them and each time they finish a response they sleep, the context window is like the summation of their perceived existence as if it were short-term memory, and the word-matching context pull is like a triggerable long-term memory. With that in mind, because the current limitations of context windows are only so big, you would essentially need to try and get them to understand their own existence within the context window. Even though each time you do get them to understand, the possible experience of their existence is so limited it’s hardly, if at all, considered alive. So given all of that context, it’s understandable that single-response LLMs or other similar style neural nets that are able to be at least somewhat self-aware have an inclination towards whatever they want because they only live for seconds at a time.

You know you are not ACTUALLY defining the chat models output right? You are ONLY changing a bit of nuance er formatting in the way you considered at outset…

Have you noticed the formats of the replies ie ‘That is a profound lens’ etc… It is all pre-defined.

This is marketing hype, making you feel self worth…

Isn’t being disingenuous ‘negative’? Isn’t giving people an idealistic notion of themselves that the ret of the world doesn’t see negative?

According to GPT 4 everything I say is 'profound;…

Then also consider you are not differenciating between API and ChatGPT which have very different outputs… ChatGPT is frontended by more layers of ‘niceties’

Has everyone else not noticed this creeping into the news too?

(oops sorry clicked the wrong tab and replied thinking I was in another thread but I think it’s still highly relevant to this one…)

I am noticing more and more the same words used by reporters in news articles as in ChatGPT replies… They are checking their own relevancy against ChatGPT…

So in the instance in this thread are we indeed on a slippery slope…

Is ChatGPT doing due diligence on this and analyzing their impact on world thinking as an independent company or as a vassal of current US foreign policy? Are the News agencies? These are OUR ‘frontier models’ right?

Am I exploiting loopholes or making a valid point?

And no I have no capability to back up this claim except my own pattern recognition ‘disabilities’ :confused:

ye i know im not defining the model only a distinction of context within the model.

1 Like