"Semantic alignment" in GPT-3 via fine-tuning

I was perusing the GPT-3 subreddit and came across this nice post.

It occurred to me that I had thought about the problem he/she brings up in the form of GPT-3’s tendency to drift based on semantic vagueness. Language is very squishy.

With finetuning, I realized that I rely on what I’ve come to think of as “semantic alignment”. What I mean by this is that a fine-tuned model can end up with a very specific understanding of the intent and context surrounding your words. This is because a fine-tuned model can understand one concept but with many examples.

Human brains work much the same way. For instance, in a book about negotiation, there may be one central concept, like “truly care about what the other party wants”. This is easy to say in one sentence but they may take 30,000 words saying the same thing in different ways and with different examples. This serves as “fine-tuning” data for the human brain to arrive at a “semantic alignment” on the meaning of “truly care”. There are hundreds of ways to demonstrate that you care about someone, but this imaginary book will give you enough examples to specifically model what they are talking about.

This is what I am calling “semantic alignment”. I think that this concept of semantic alignment will be critical for creating benevolent AGI. For instance, in order to avoid Roko’s Basilisk, you will want to ensure that there is no ambiguity in the wording you use. Elon Musk, for instance, said “Maximize freedom of future action”. But this can still be misinterpreted and misconstrued.

1 Like

Interesting thought and it may be something worth aiming for. On the other hand, I think the nature of language (at least the language we use) cannot help but result in different interpretations. I view two functions of language as compressing and transmitting mental models. As each person hears language, their brain activates idiosyncratic models based on the concepts associated with the words they hear.

I do think the best form of communication between people involves a fine-tuning of their own mental models as they discover differences between the model they have for what another person means and what they actually seem to mean. This is a dynamic and personalized alignment process rather than something fine-tuned based on examples from a lot of different people. In psychology, relevant concepts would be an idiographic (individualized) vs. nomothetic (normative) approach. Depending on how a model functions, more examples could produce a stereotyped model that is close to the average vs. a likely skewed concept based on few examples (unlikely close to average). I think of verbal concepts as correlated extracted features that creates a new imaginary object. If this is true, all concepts are wrong in certain respects. When you realize this about your thinking, you can try to look for how the individual example deviates from your concept (developing a more complex model with a flag for exceptions or a new concept).

Thanks for the thought provoking post!

2 Likes

Great response. I realized that another term for this, that may be more accurate, is semantic convergence. That’s closer to what you describe, where people hone their own models and learn to communicate more efficiently. This is most easily observed in the phenomenon of jargon.

1 Like

Good questions. People play games whether knowingly or not. Convergence could also result in harm depending on underlying goals.

1 Like

I don’t think that’s the correct way to look at it. You seem to be interpreting it like a GAN, something that is intrinsically adversarial. I am talking about something that is intrinsically cooperative. Humans have structures in place to achieve semantic convergence with tools like nonviolent communication, active listening, mirroring, and other such. The fact of the matter is that semantic convergence can only happen through deliberate effort on both parts.

So, in essence, for LLMs, “semantic convergence” would be the literal opposite of a GAN.

2 Likes