Embeddings are necessary things that mainly exist because we need reduced dimensions to make things efficient.

As a contrast alternative to embeddings, you have 1-hot encodings. So a massive vector with zeros everywhere, except for one of the dimensions to have a single one. So like …

[0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0]

See the single one? That is supposed to represent a single thing, like a single word, or sentence, to the model.

Embeddings collapse this high dimensional vector, and “embed” this same information into a lower dimensional space, so for example the above high dimensional vector is represented in this new low dimensional (embedding) space as:

[0.54, 0.35]

Inside the neureral network, it sees this lower dimensional representation, and therefore needs less multiplications, less computing (less dimensions). So mathematically it’s more efficient, but if you notice, it has floating point numbers instead of mostly zeros and a single one, so it is considered “dense” instead of “sparse”.

Also, the embedding helps the neural network “learn” because close embeddings equate to close meaning. If you have tried to train an NN with one hot encodings, you will see that it takes much longer for the network to converge (because 1-hot encodings are not continuous)

So dimensional reduction (faster inference) and faster learning (more“continuity” to the network).

Having said all this, since the embeddings are a dimensional reduction, and essentially a learned parameter, it’s plausible that they could be taken internally from the model, such as GPT-4, as some “hidden state”, or internal numerical representation of the input that the neural network will act on to produce it’s final output.

So … how is this hidden state really trained? Well, umm, gradient descent … but really it depends on the objective of the model, which comes down to it’s training data.

So in your examples, we don’t know theoretically (because it’s not theoretical, and driven by training data), and instead we can only say empirically by comparing the embedding vectors of each thing.

Two vectors should be close if the expected final full model output is similar (continuity).

However, ideally, in my perfect word, we could train 1-hot encoded GPT models without ambiguity or closeness being a concern, since we would have infinite compute capacity …. BUT approximations are in order, as they always are if the compute capacity is beyond our current capabilities, which is the case here today.

HENCE EMBEDDINGS!