I have a question about how to model meaning in vectors. Is it possible to capture information more specific than “these two words/paragraphs are similar”? For example, I’m assuming the words “organization” and “corporation” would be fairly close in the vector space. But what if I want to capture something more than “organization” and “corporation” have similar meanings. I.e., in this case I would like to know that “organization” is more general than “corporation” or in set theoretic terms that organization subsumes or is a superset of corporation. Can that be captured via vector analysis? I think the answer is no but I wanted to make sure.
You can turn phrases into vectors too, though if they get too large, the math gets a bit fuzzy…
What are you trying to do?
I’m trying to capture more than just a relation that says “these two text strings have similar meaning” and instead capture something specific about the meaning. E.g., that one string is more general or specific then another. Or another example would be that one string is an instance of another. E.g., vectors for “Obama” and “President” would be similar but I want to recognize that Obama is an element in the set of Presidents (less formally that Obama is an instance of the concept President). The reason I’m asking is because I’m working on how to use LLMs to help automate the creation of ontologies in the Web Ontology Language (OWL) and these are the kind of relations that OWL models (its a subset of FOL called Description Logic).
You could maybe run both versions and get the numbers from them, but no, it can’t really differentiate AFTER it has been turned into a vector…
But what you could do is cluster a bunch of chunks together; given a corpus that is chunked; from the vectorized representations of chunks.
If the corpus has enough metadata, then the clustering might become meaningful
Yes you can! (tm, Obama 2008)
One thing to keep in mind is that the word “President” is not necessarily the superset of any group of presidents you think of.
While some people scoff at the concept of vector arithmetic with LLM based embeddings, I think it’s very useful.
I’m gonna give you some very [word denoting of limited quality] data (there’s a bug with the clustering, and it’s mistral, but anyways)
The embedding of “First Lady”, in relation to the first ladies
The average of all first ladies in relation to the first ladies
The concept of “First Lady” or “FLOTUS” will generally be closer to the average of all first ladies than to any specific first lady.
The first ladies will generally be closer to the to the average of all first ladies, than to the descriptor of first ladies. A good descriptor of first ladies will generally be closer to the average of all first ladies than any specific first lady.
So I would say that the average of a lot of examples actually represents the superconcept, while the embedding of the word for the superconcept doesn’t necessarily reflect the superconcept, if that makes sense.
Visualization for the win.
Thanks for taking the time to share!
This is always fascinating stuff.
Presumably, the concept of First Lady
will compromise some proportion of the embedding vector of the name of an arbitrary First Lady.
There’s likely a competing dynamic between how famous they were in terms of being a “First Lady” and how famous they may have been outside of that capacity.
An interesting case study might be Hilary Clinton who quite possibly had more ink spilled about her tenure as First Lady, but who also is undeniably famous outside of that capacity in a way no other First Lady has ever been.
Then there’s the fact that the concept of First Lady
comprises many other concepts,
- President
- Wife
- Figurehead
- Etc
Each of those concepts comprises still more others.
Beyond that, there’s the attention mechanism which alters the weights depending upon what else is around it.
So, the concept of First Lady will almost certainly vary in substantial ways.
Imagine there’s a question on a survey that asks , "what are the primary responsibilities of a First Lady? " You’ll likely get wildly different answers if you put a picture of Abigail Adams next to the question then you would if you put a picture of Dolly Madison, or Edith Wilson, or Eleanor Roosevelt, or Jacqueline Kennedy, or Betty Ford, or Rosalynn Carter, or Nancy Reagan, or Hillary Clinton, or Michelle Obama.
Just as the concept of Abigail Adamsis partly defined by the concept of
First Lady, the concept of
First Ladyis partly defined by the concept of
Abigail Adams`, they cannot be disentangled.
That said, taking the mean of the embedding vectors of all of the first ladies is as good as any method I can think of off the top of my head for determining the value of the concept of a First Lady
.
One thing that certainly limits and complicates the utility of this sort of concept arithmetic is that the vectors are all normalized. It’s intuitive to imagine the length of an unnormalized embedding vector might correlate to the strength or intensity of the concept it is representing, but it’s almost certainly not quite that simple (though I do wish it were possible to get raw embedding vectors out of the models).
I also wish they had something like the neuron viewer for the embedding models.
Use Category theory? A combination of transformers and Category theory could be quite useful.
Thanks to everyone for the feedback, especially Diet Regular, great stuff!
Do you have experience with this? Do you have any tips, tricks, caveats, etc?
A well-trained embedding model should already be doing this. Organization will be more general than corporation by being closer in latent space to more other concepts. Corporation will be narrower by being near other “feature clusters” that are less interconnected. Remember, these are 1000+ dimensional representations of semantic relationships on a positive and negative scale.
There is a difference, though, in “capture” and “employ”.
Embeddings layers and dimensional activations and relationships may capture countless unobvious meanings, from “proper name”, “English”, to “19th century” or “colonial USA”, or even “uncertain context” or “two words” “commonly followed by the word ‘is’” in Betsy Ross or Dolly Madison, but it is then down to how you would employ that.
Basically, the only meter we have is distance: comparison between two vectors produced by the same AI model’s training.
I was thinking that as well, that it should be possible to discover more than “these two meanings are similar” but from what I know (and I’m totally new to this stuff) while that information may implicitly be there as you describe it is one thing to say that and another to actually be able to figure out how to tease out that kind of info as opposed to just computing semantic distance.
That’s what I thought as well.
What you are referring to is ‘interpretability’, which is translating a model’s inputs and internal calculations back into a human-understandable. In general, ML systems lack interpretability which is why we score their performance instead of “understanding” them.
How to make things like text embeddings more interpretable is, to my knowledge, an area of active research, so you’re wading into some very cutting-edge subject matter at the boundaries of what is known. Pretty exciting stuff! One of my colleagues pointed me to this blog post where the author is using sparse autoencoders (and I don’t know what that means yet either), but basically using raw embedding values to map out the features recognized by the model:
A bit. Because of my background (Math, Simultaneous Interpreter), I see Language a bit differently than connectionists. More as functions among objects. Haskell seems to be a good tool. I believe that LLMs are like the starter engine for something far more powerful. Would be delighted to discuss further.
Absolutely. My concern is that embedding spaces may not be wholly convex - that they might be scrunched/creased in some places, not unlike a brain lol, where these transformations might break down in unexpected ways. (haven’t found such anomalies yet though)
I am looking at Voronoi-like tessellations of the space. That would hopefully take care of the convexity.
voronoi/dirichlet tesselations on the surface of these ~1000 dimensional unit spheres?
basically how IVF works? hmm…