Similarity of embeddings at different contextual levels

I persist embeddings from a text by word, sentence and by paragraph, I can do a similarity search at the same level, but I was curious how well it works between “Conceptual levels” for example:

  1. I love zebras and horses
    2.zebras
    3.horses
    4.I love cats and dogs

1 and 4 I can imagine in a neighborhood of “my relation to animals”
2 and 3 I can imagine in a neighborhood of animals

How much of the context of zebras and horses is likely embedded in #1 as opposed to the overall “idea as a whole”?

Can anyone point me to good research/papers in this area?

Thanks!

Picture a completion engine, you give it input tokens like your example, and ask it to continue writing. What is it going to write about? What will it’s next token be influenced by.

That’s sort of how embeddings work, they stop the processing short of actually making output, and get the hidden state of the machine after it has performed that analysis.

So an embedding is sort of the topic and mindframe the AI is put in by what it just read. Consider this whole embedding:

Me: What do you about Stephen Pinker?
You: Stephen Pinker, born on September 18, 1954, is a renowned cognitive psychologist, linguist, and author who has left an indelible mark on the fields of psychology and linguistics. With a diverse academic background from McGill University, Harvard University, and Stanford University, Pinker has held prominent positions at some of the world’s most esteemed institutions. His intellectual contributions encompass a broad range of topics, including language development, cognitive science, and the nature of human behavior. Throughout his prolific career, he has authored several influential and best-selling books, such as “The Language Instinct,” “How the Mind Works,” and “The Blank Slate,” earning him accolades for his compelling and accessible writing style. Stephen Pinker’s work has not only reshaped our understanding of the human mind but has also made him a prominent public intellectual, stimulating discussions on the intricacies of human nature and the potential for progress in our complex world.
Me: Never mind all that, new topic. Summarize the Carter administration in three words.
You: Challenges, Energy, Human rights.

So what is that embedding about? Are we going to use it to retrieve biographical information, or will US presidents be more relevant, because that’s what the AI had to answer about? Or maybe it’s even now considering what “me” should be talking about.

So that’s the mystery of embeddings, and how they’d word when just based on a large language model. However, ada-002 embeddings was trained to be an embeddings model, with different dimensions than any GPT-3, so it is even more of a black box to predict.

One experiment and hypothesis I have is to evaluate the quality of matches when we start truncating the end and put the AI in the middle of a sentence. It no longer has the finality of seeing a period, the hidden state is based on an incomplete thought. If the concept of “end of a sentence” was part of the semantic definitions of all your embeddings, would that reduce the distance? All interesting things I’d like to answer myself, but nobody’s granting me money.

Here’s a bit about jumping inside that hidden state as part of transformer language formation. https://jalammar.github.io/hidden-states/

1 Like

For evaluation of match quality, for the “incomplete” case, I wonder if you generated the rest of the sentence and then embed that, how it would match. it seems the embedding on the completed chunk would be as close to the expected match as it knew how to make. I wonder how that would relate to the embedding of the created completion.

As for context, especially in larger, more heterogeneous cases such as your fragment above, if each fragment is not that close to the embedding of the whole, maybe that tells something about the cohesiveness of the source, or at the least, the range of abstractions being dealt with.

To bring the two parts of your text together, linked by common themes, you would have to abstract up to the level of public figures, or the topic of history, or some such. The level of these abstractions is far beyond the detail level of each part, which makes the text as a whole rather jarring. Maybe that is a clue that you have two different thoughts going on in the same text.

All of this is easier if I can show similarity of parts to whole on small scale, such as word to sentence and sentence to paragraph.

Your link was interesting to me, though I am not an expert in the field.

Thanks for responding to my post!

Embeddings are necessary things that mainly exist because we need reduced dimensions to make things efficient.

As a contrast alternative to embeddings, you have 1-hot encodings. So a massive vector with zeros everywhere, except for one of the dimensions to have a single one. So like …

[0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0]

See the single one? That is supposed to represent a single thing, like a single word, or sentence, to the model.

Embeddings collapse this high dimensional vector, and “embed” this same information into a lower dimensional space, so for example the above high dimensional vector is represented in this new low dimensional (embedding) space as:

[0.54, 0.35]

Inside the neureral network, it sees this lower dimensional representation, and therefore needs less multiplications, less computing (less dimensions). So mathematically it’s more efficient, but if you notice, it has floating point numbers instead of mostly zeros and a single one, so it is considered “dense” instead of “sparse”.

Also, the embedding helps the neural network “learn” because close embeddings equate to close meaning. If you have tried to train an NN with one hot encodings, you will see that it takes much longer for the network to converge (because 1-hot encodings are not continuous)

So dimensional reduction (faster inference) and faster learning (more“continuity” to the network).

Having said all this, since the embeddings are a dimensional reduction, and essentially a learned parameter, it’s plausible that they could be taken internally from the model, such as GPT-4, as some “hidden state”, or internal numerical representation of the input that the neural network will act on to produce it’s final output.

So … how is this hidden state really trained? Well, umm, gradient descent … but really it depends on the objective of the model, which comes down to it’s training data.

So in your examples, we don’t know theoretically (because it’s not theoretical, and driven by training data), and instead we can only say empirically by comparing the embedding vectors of each thing.

Two vectors should be close if the expected final full model output is similar (continuity).

However, ideally, in my perfect word, we could train 1-hot encoded GPT models without ambiguity or closeness being a concern, since we would have infinite compute capacity …. BUT approximations are in order, as they always are if the compute capacity is beyond our current capabilities, which is the case here today.

HENCE EMBEDDINGS!

3 Likes

Excellent response, I think that counts as a scholarly article!

1 Like