Capturing Meaning other than Similarity (e.g., generalization) in vectors?

mdebellissf · June 26, 2024, 9:12pm

I have a question about how to model meaning in vectors. Is it possible to capture information more specific than “these two words/paragraphs are similar”? For example, I’m assuming the words “organization” and “corporation” would be fairly close in the vector space. But what if I want to capture something more than “organization” and “corporation” have similar meanings. I.e., in this case I would like to know that “organization” is more general than “corporation” or in set theoretic terms that organization subsumes or is a superset of corporation. Can that be captured via vector analysis? I think the answer is no but I wanted to make sure.

PaulBellow · June 26, 2024, 9:20pm

You can turn phrases into vectors too, though if they get too large, the math gets a bit fuzzy…

What are you trying to do?

mdebellissf · June 26, 2024, 9:31pm

I’m trying to capture more than just a relation that says “these two text strings have similar meaning” and instead capture something specific about the meaning. E.g., that one string is more general or specific then another. Or another example would be that one string is an instance of another. E.g., vectors for “Obama” and “President” would be similar but I want to recognize that Obama is an element in the set of Presidents (less formally that Obama is an instance of the concept President). The reason I’m asking is because I’m working on how to use LLMs to help automate the creation of ontologies in the Web Ontology Language (OWL) and these are the kind of relations that OWL models (its a subset of FOL called Description Logic).

PaulBellow · June 26, 2024, 9:36pm

You could maybe run both versions and get the numbers from them, but no, it can’t really differentiate AFTER it has been turned into a vector…

icdev2dev · June 26, 2024, 9:49pm

But what you could do is cluster a bunch of chunks together; given a corpus that is chunked; from the vectorized representations of chunks.

If the corpus has enough metadata, then the clustering might become meaningful

Diet · June 26, 2024, 11:14pm

Yes you can! (tm, Obama 2008)

One thing to keep in mind is that the word “President” is not necessarily the superset of any group of presidents you think of.

While some people scoff at the concept of vector arithmetic with LLM based embeddings, I think it’s very useful.

I’m gonna give you some very [word denoting of limited quality] data (there’s a bug with the clustering, and it’s mistral, but anyways)

The embedding of “First Lady”, in relation to the first ladies

The average of all first ladies in relation to the first ladies

~~The concept of “First Lady” or “FLOTUS” will generally be closer to the average of all first ladies than to any specific first lady.~~

The first ladies will generally be closer to the to the average of all first ladies, than to the descriptor of first ladies. A good descriptor of first ladies will generally be closer to the average of all first ladies than any specific first lady.

So I would say that the average of a lot of examples actually represents the superconcept, while the embedding of the word for the superconcept doesn’t necessarily reflect the superconcept, if that makes sense.

PaulBellow · June 26, 2024, 11:25pm

Visualization for the win.

Thanks for taking the time to share!

anon22939549 · June 27, 2024, 5:57am

This is always fascinating stuff.

Presumably, the concept of First Lady will compromise some proportion of the embedding vector of the name of an arbitrary First Lady.

There’s likely a competing dynamic between how famous they were in terms of being a “First Lady” and how famous they may have been outside of that capacity.

An interesting case study might be Hilary Clinton who quite possibly had more ink spilled about her tenure as First Lady, but who also is undeniably famous outside of that capacity in a way no other First Lady has ever been.

Then there’s the fact that the concept of First Lady comprises many other concepts,

President
Wife
Figurehead
Etc

Each of those concepts comprises still more others.

Beyond that, there’s the attention mechanism which alters the weights depending upon what else is around it.

So, the concept of First Lady will almost certainly vary in substantial ways.

Imagine there’s a question on a survey that asks , "what are the primary responsibilities of a First Lady? " You’ll likely get wildly different answers if you put a picture of Abigail Adams next to the question then you would if you put a picture of Dolly Madison, or Edith Wilson, or Eleanor Roosevelt, or Jacqueline Kennedy, or Betty Ford, or Rosalynn Carter, or Nancy Reagan, or Hillary Clinton, or Michelle Obama.

Just as the concept of Abigail Adamsis partly defined by the concept ofFirst Lady, the concept of First Ladyis partly defined by the concept ofAbigail Adams`, they cannot be disentangled.

That said, taking the mean of the embedding vectors of all of the first ladies is as good as any method I can think of off the top of my head for determining the value of the concept of a First Lady.

One thing that certainly limits and complicates the utility of this sort of concept arithmetic is that the vectors are all normalized. It’s intuitive to imagine the length of an unnormalized embedding vector might correlate to the strength or intensity of the concept it is representing, but it’s almost certainly not quite that simple (though I do wish it were possible to get raw embedding vectors out of the models).

I also wish they had something like the neuron viewer for the embedding models.

paolodim · June 27, 2024, 3:03pm

Use Category theory? A combination of transformers and Category theory could be quite useful.

mdebellissf · June 27, 2024, 3:31pm

Thanks to everyone for the feedback, especially Diet Regular, great stuff!

Diet · June 27, 2024, 3:33pm

Do you have experience with this? Do you have any tips, tricks, caveats, etc?

bobartig · June 27, 2024, 4:01pm

A well-trained embedding model should already be doing this. Organization will be more general than corporation by being closer in latent space to more other concepts. Corporation will be narrower by being near other “feature clusters” that are less interconnected. Remember, these are 1000+ dimensional representations of semantic relationships on a positive and negative scale.

_j · June 27, 2024, 4:03pm

There is a difference, though, in “capture” and “employ”.

Embeddings layers and dimensional activations and relationships may capture countless unobvious meanings, from “proper name”, “English”, to “19th century” or “colonial USA”, or even “uncertain context” or “two words” “commonly followed by the word ‘is’” in Betsy Ross or Dolly Madison, but it is then down to how you would employ that.

Basically, the only meter we have is distance: comparison between two vectors produced by the same AI model’s training.

mdebellissf · June 27, 2024, 4:11pm

I was thinking that as well, that it should be possible to discover more than “these two meanings are similar” but from what I know (and I’m totally new to this stuff) while that information may implicitly be there as you describe it is one thing to say that and another to actually be able to figure out how to tease out that kind of info as opposed to just computing semantic distance.

mdebellissf · June 27, 2024, 4:11pm

That’s what I thought as well.

bobartig · June 27, 2024, 4:55pm

What you are referring to is ‘interpretability’, which is translating a model’s inputs and internal calculations back into a human-understandable. In general, ML systems lack interpretability which is why we score their performance instead of “understanding” them.

How to make things like text embeddings more interpretable is, to my knowledge, an area of active research, so you’re wading into some very cutting-edge subject matter at the boundaries of what is known. Pretty exciting stuff! One of my colleagues pointed me to this blog post where the author is using sparse autoencoders (and I don’t know what that means yet either), but basically using raw embedding values to map out the features recognized by the model:

https://thesephist.com/posts/prism/

paolodim · June 27, 2024, 6:19pm

A bit. Because of my background (Math, Simultaneous Interpreter), I see Language a bit differently than connectionists. More as functions among objects. Haskell seems to be a good tool. I believe that LLMs are like the starter engine for something far more powerful. Would be delighted to discuss further.

Diet · June 27, 2024, 6:23pm

Absolutely. My concern is that embedding spaces may not be wholly convex - that they might be scrunched/creased in some places, not unlike a brain lol, where these transformations might break down in unexpected ways. (haven’t found such anomalies yet though)

paolodim · June 27, 2024, 6:40pm

I am looking at Voronoi-like tessellations of the space. That would hopefully take care of the convexity.

Diet · June 27, 2024, 6:44pm

voronoi/dirichlet tesselations on the surface of these ~1000 dimensional unit spheres?

basically how IVF works? hmm…

Topic		Replies	Views
Does it make sense to add context to text before embedding? API embeddings	10	2534	February 25, 2024
Idea of context for GPT 3 API API	15	3657	December 15, 2023
How to manage safety issues for large volumes of embeddings Community	5	1024	December 20, 2022
Can this api be used to query internal data? API	35	8489	April 20, 2023
Example incorporation into query formulation API	14	1359	December 16, 2023

Capturing Meaning other than Similarity (e.g., generalization) in vectors?

The embedding of “First Lady”, in relation to the first ladies

The average of all first ladies in relation to the first ladies

Related topics