No matter how long the input string is, the return of creating embeddings with “text-embedding-ada-002” is always a vector of 1536.

Can someone please explain to me why this is so? I would expect that a text consisting of 2 words would create smaller embedding vectors than a whole paragraph. But that’s not the case. They always have the same length.

(Disclaimer: I am still learning; please forgive me if it’s a stupid question)

Embeddings generally used to find the similarity between different pieces of text. The notion of ‘similar’ has to be found with some sort of distance measure.

Normally the Euclidean Distance or Cosine Distance are used and both of these require vectors to be of the same dimension.

Let’s say the embeddings scaled with the length of the text. The issue here is it is not clear how to use embeddings of different length and compare them in a meaningful way.

Hopefully the model has learnt to represent text regardless of length in a meaningful way so that it can be used for comparisons.

An embedding is a high dimensional vector, so it points to a single “location” in that higher dimensional space. So the number of dimensions is what controls the size of the vector.

I have the same question, which led me to this post…

I was confused to see that embedding “Hello world!” produced a 1536D vector. I even tried it on the letter “h”; again, a 1536D vector was produced.

I have basic knowledge in NLP techniques, so I hope someone corrects me if I’m wrong, but I assume that the model generalizes into 1536D space for the sake of being able to compare any given input.

To reiterate what james.bower said: the distance formulas require the vectors to be of the same dimension.

Therefore, the model must either reduce or increase dimensions to fulfill the requirement of the vectors being in the same space.

Why 1536 was chosen? I think it might be a dimension that frequently showed good results, but I’m not entirely sure.

The new embeddings have only 1536 dimensions, one-eighth the size of davinci-001 embeddings, making the new embeddings more cost effective in working with vector databases.

This is an OpenAI blog entry that specifically notes the same embedding model and size you note, please check the blog to learn more.

This is not a stupid question and one worthy of asking. Yes that is a large vector and it took me three hops in searching to find the blog entry. Started with ChatGPT then ChatGPT Plus using web browsing to get to a few research papers, then jumped to a Google search with both the model name and vector size to finally find the blog entry. The reason I started with ChatGPT Plus with web browsing was because I wanted a research paper with authors from OpenAI but that was not producing what I wanted so settled for the blog entry from OpenAI.

Update

For a bit more technical background on high dimensional spaces see this 3Blue1Brown YouTube video