Embedding model's dimension

Hello everyone.
I want to know if I can set dimension when using embedding model, especially text-embedding-ada-002.
I know max dimension of that model is 1536.
But it is necessary to customize it.
Because we can create index of Pinecone vector database with varying dimension.
But when using text-embedding-ada-002 to insert data to that index, there is error.
I think that’s because of dimension.
So please let me know how to fix it.

1 Like

Hey,

Just one question, what are the benefits behind having various dimensions vectors in DB for same properties of objects?

Thanks for your reply.
I want to test when dimension varies.

An embeddings model works on a particular number of dimensions. where each has semantic meaning. If you replace one with your own internally, it might not have a terrible of effect. If you change the dimensionality, especially on just some, you’ll break tools. It would of course need to still be a float.

If you keep the difference between values low, such as representing “source version” as the the third and fourth digits of a number that’s the average of the embedding numbers that you’ve been getting from your whole dataset, maybe you can make your own custom in-dimension data retrieval, replacing the last dimension with yours throughout, but it will just exist in there, you won’t have much use.

Or you could make two extremes of a replacement value, such as (max-min tensor value ever) x dimensions so that dot products are classified into almost two different callable database embedding vector math product spaces.

1 Like

Thanks for your reply.
But I think you are misunderstanding my question.
I want to know if I can customize the dimension of text-embedding-ada-002 model, for example from 1536 to 1024.
Because I have created index with 1024 dimension in Pinecone Vector database.

Most likely this is not going to work because the embeddings are created for a specific model using a specific technique. In other words we cannot even use our embeddings created for 3.5 to work with GPT 4. Thus you shouldn’t expect simply adapting the number of dimensions is not going to break a whole lot of functionality.

So, unless you transform all embeddings to match the model structure that you are developing for this will likely be a instructive exercise.

Yes, if you simply want to reduce dimensionality stored in general, and aren’t trying to mash things up, that is possible.

Discard a position list of random 33% can still have good matching. You likely don’t need to try an encoder.

Or other reduction techniques.

One model’s result can’t be combined with another though, there are completely different semantics to the dimensions.

1 Like

Could you explain in more detail?
I think your idea is good.

I gave my simple specifications to a chatbot:

def reduce_dimensions(input_list):
    if len(input_list) % 3 != 0:
        raise ValueError("Input list length must be divisible by 3")

    reduced_list = [input_list[i] for i in range(len(input_list)) if i % 3 != 2]
    return reduced_list

You can use this function by passing your 1536-dimensional list to it, and it will return a new list with 1024 dimensions, keeping two out of every three elements from the original list.

Example usage:

input_list = [1.0, 2.0, 3.0, 4.0, 5.0, 6.0, ...]  # Replace this with your actual 1536-dimensional list
reduced_list = reduce_dimensions(input_list)
print(len(reduced_list))  # Output will be 1024

Okay
But if we can do semantic search with such reduced dimension, does it work well?

One can only experiment with your particular data and try to find out how much the quality has degraded. But we can say it is not likely to be better by tossing information.

You can also try different algorithmic techniques on the dimensions, perhaps discarding half of the last two-thirds instead of 1/3 of all, if we think first dimension directions are more important (and that I don’t know).

Adding two dimensions together to try to preserve them would have an “averaging” effect that simply makes those two blurred. The linked paper for dimensional reduction supposes an amplifying effect of its algorithm.

1 Like

The solution provided by @_j is called “decimation”. And decimation without filtering leads to aliasing (which is usually bad, but try it anyway since it is such an easy technique).

In this spatial vector case, you should look into dimensional reduction techniques. Such as PCA, t-SNE, etc. This would apply the correct filtering prior to decimation (in this case, dimension reduction).

These techniques work on a batch of already existing embeddings. So you have to embed a bunch of data, run the algorithm(s) on the batch. This basically will pick the “important dimensions”, and in your case you want the top 1024 most important dimensions.

But beware, OpenAI probably has decided that 1536 is some critical number of dimensions for the model to perform well. However, you can test this theory by reducing the dimensions, and see for yourself.

Good luck!

5 Likes

Your advice is very good.
I’d ask one more question.
Which embedding algorithm is used in OpenAI embedding model?

The latest model ada-002 is a trained AI model, just like the rest of them. I don’t know if there exact algorithm details are published, but there is plenty of research and code on training your own embedding model out there.

Thanks for your great response.
I have one more question.
Is it possible to generate context for Q&A from dimension reduced vectors?

“generate context for Q&A” doesn’t really make much sense to me. There is some information there, but a particular dimension might, for a particular path of arriving there, be something like “how much does this seem like government documentation vs how much does this seem like forum chat, plus how greedy does the writer seem to be”, and since we’ll not turn it back into a machine state, we can only guess by fuzzing millions of other embeddings to try to let machine learning techniques try to figure out what a dimension’s activations are.

Yes, I think you can, but I don’t, and here is my “long story long” version of why:

I “reduce dimensions” to really increase the spread of the embedding vectors. Sounds counterintuitive? Yep, but here is the code I wrote, over here in this post:

The example here is assume all your embeddings start out in 3 dimensions. The vectors form a cloud that looks like a pancake tilted about the origin, maybe has one side longer than the other, somewhat like a tilted “elliptical pancake” with some small thickness.

So the PCA I employ of the code quickly finds the major and semi-major axis vectors describing every vector in this pancake, except I drop the thickness dimension.

I find that the two 3 dimensional vectors of this pancake (major/semi-major vectors) describe 95% of the information. In the code I have, you can see how much information each dimension has, and can decide how much information you want to retain.

However, I keep the “tilted” 3-d semi-major and major axis vectors. They only span a 2-d plane in the original 3-d space, but I keep the 3 dimensions … and here’s why … DevOps!

So suppose months go by, and I re-fit the newer larger set of vectors … I would expect the basis vectors to shift slightly … and I have to be aware that they may, in-fact, change order, or sign, or both. So I would have to compare the new vectors, to the old vectors (in the original high dimensional space) to check that this is still true, and if not, I need to swap postitions, and perhaps negate these vectors to “re-allign” to the original frame … and only do this for backwards compatibity! This is especially true if you choose to add or drop dimensions over time!

So if you don’t care about backwards compatibility, you could rotate your two 3-d vectors back into 2-dimensions, and use them in the lower dimension space to do your searching. The good thing about the lower dimensional space is you have a smaller database and your search will be faster. But I see no theoretical reason why you can’t do this, as long as you are consciously aware of what percentage of information you are throwing away, and not needing backwards compatibility for future fits.

You can, of course, re-fit everything, and in this case “upgrade” the entire past set of embeddings (providing backwards compatibility) … just more computation and database read/writes (less efficient) but that could be done too if you would like.