Methodological information about embeddings

tanner49 · December 6, 2021, 3:17pm

I’m curious if we know anything about how the embeddings being outputed from the /embeddings endpoint are being generated.

A lot of past work on embeddings is at a word-level. So, if you wanted to calculate the similarity of two texts (> 1 word each), you would use a mean embedding method and average together all of the embeddings for the words in the texts. It tended to be a really blunt instrument.

Does anyone know how these embeddings are being calculated, at least on a conceptual level? I was thinking maybe they were being grabbed from one of the self-attention layers, but even that seems to be on a word-level. Are they just drawn from a dense layer lower down-stream? Curious if we have any idea.

boris · December 6, 2021, 3:36pm

These embeddings work at the text level. So all the tokens provided are used to produce a single embedding, which is not a piecewise composition, like you’d do previously with word embeddings.

These models are specifically trained to produce a high quality similarity or relevance embedding of a particular size, rather than simply taking the weights after x-layers of a language model which predicts the next word.

asabet · December 25, 2021, 4:57pm

@boris Is the approach some kind of contrastive loss with BERT-style CLS tokens?

Topic		Replies	Views
Learning token embeddings API codex	0	535	May 10, 2022
Text embeddings vs word embeddings API embeddings	1	2239	September 4, 2023
Embedding tokens vs embedding strings? API	12	6864	February 11, 2024
Similarity of embeddings at different contextual levels Community embeddings	4	1317	July 29, 2023
Embeddings for tokens used by GPT models? API	2	824	December 17, 2023

Methodological information about embeddings

Related topics