Open AI embedding with 3072 dimensions

I was expecting better semantic matching with 3072 dimensions. Thinking that it will extract more features from the text to match. But I got very poor results. With the legacy one, I was getting a cosine score of 0.81 for similar documents, the score with the new model went below 0.6

Thanks

It is not safe to compare these figures across models.

should we be using a different similarity measure with text-embedding-ada large model. If two similar documents are getting a low score, then the purpose of using semantic matching is defeated

Yes, the newer models (gen 3) usually require lower threshold numbers to bring back a similar number of matches.

This is not a “large model” afaik, its only second gen and 1536 dimensions.

My bad, I meant this model text-embedding-3-large

The new models have a larger cosine similarity range than the previous.

Realize that, in general, you should be getting similarity scores between -1 to 1, not 0.7 to 1.0

So the new models have a wider scale, but pay attention to rankings. Do the new rankings make sense, like top 5 rankings on the new model vs the old model? I’d pay more attention to this after to adjust your thresholds.

2 Likes

If anything, the fact you are seeing much lower scores should inform your opinion that the model may be substantially better than the previous and is capturing better semantic meaning than the previous model. As it is seeing more semantic difference between the two example vectors.

But, as has already been said, you can’t really compare the two metrics, at least not directly, because in the end the actual score isn’t as important as the relative ranking of scores for different vectors.

You could have all cosine similarity scores for any vectors you put into a model be above 0.999 or below 0.001 and it wouldn’t really matter as long as the ranking of those scores was appropriate.

In very high dimensions any two random vectors would be approximately orthogonal, so for that reason alone we would generally expect embeddings in a 3072-dimensional space would be less similar than embeddings in a 1536-dimensional space.

Incidentally, there are some nice mathematical properties to be had if the embedding space was an actual vector space which would make it closed under addition and scalar multiplication, the fact all embedded vectors are normalized to the unit-length makes this impossible.

I wish OpenAI would share the pre-normalized embeddings, but I don’t see that happening anytime soon.

3 Likes

I haven’t seen any below 0. OpenAI seems to have ensured a dynamic range 0-1. Use max(0, cosine_similarity(x,y)), and nothing is lost.

The dot product scores now plummet quickly away from 1.


Some comparisons of two paragraphs from above (d=1536 of 3-large)

Translate to English (from English) performed by forum AI

1:“It is conceivable that all cosine similarity indic”
match score: 0.9092

Changing one word

2:“You could have all cosine similarity scores for an”
match score: 0.9956

A token > 100k

3:“ிஔ
match score: 0.0491

And the change to full dimensionality seems to demonstrate more rejection:

== Cosine similarity comparisons ==
0:“You could have all cosine similarity scores for an”
match score: 1.0000
1:“It is conceivable that all cosine similarity indic”
match score: 0.9002
2:“You could have all cosine similarity scores for an”
match score: 0.9952
3:“ிஔ
match score: 0.0360

I think the other problem with my use case is that I have large chunks, If I reduce the chunks I may get a better result with 3072. Probably in the current situation the chunk is talking about more concepts that the prompt. In the previous model due to low dimensions, it was working ok since it was probably extracting less features from the text.

I’ve found Ada 2 is superior (results in a better set of matches) in my tests for same chunk size and dimensions whilst accounting for different thresholds.

What is the threshold score that you have used. Would be curious to know the threshold and your chunk size

Some of the things that I saw with ada 2 is that it considers negative and positive statements semantically similar. I was expecting that it will be semantically opposite

Yes I am aware of this. But it’s particular to the new model(s), and is atypical to the general math situation.

For example, in general, you should get values ranging from -1 to +1.

But yes, here it looks like 0 to 1 for the latest models. Haven’t fully tested it myself.

Also, in general, any two random uncorrelated things will have a dot product of 0 (orthogonal).

If correlated, a dot product closer to 1.

If anti-correlated, a dot product close to -1.

The angles are:

0 degrees, for dot product +1
90 degrees, for dot product 0
180 degrees, for dot product -1.

This is how the general math works out, and not all embedding engines these days have faithful mathematical properties, probably because of how they are generating the vectors. Plus, it may not be all that important if all you are doing is getting a relative list of top rankings.

Users need to be aware of these differences and peculiarities. Already folks are saying the new model is worse because they don’t get a bunch of correlations above 0.8 anymore. These folks are misleading themselves.

Instead, they should be worried about the relative ranking of the new model. So rank 1, 2, 3, 4, 5 vs. rank 1, 2, 3, 4, 5 between models, and which one makes more sense.

Here is why, from ChatGPT. The formula is general, but the example is in 3 dimensions, but it extends to arbitrary dimensions, like 3072, as well. Also, the magnitudes coming out of the embedding engine are all 1.0, so no need to compute this with the square root of the sum of the squares, it’s wasted computation here, since you are diving by 1.0 and this doesn’t change anything. If your vectors don’t always have length 1.0, then you should scale your dot products by dividing by the lengths of each vector, as shown in the formula.

Here is the graph of the inverse cosine. The input ranges over -1 to 1, output is from 0 to \pi radians, or 0 to 180 degrees.

The -1 to 1 input comes from your cosine similarity, or dot product. Most people do not find the angle in practice, because they are looking at relative rankings, and the inverse cosine is extra computation.

Also, fun fact, the new models are simply truncating and re-scaling the truncated vectors back to length 1.0. So you can do this on your own to carve out arbitrary dimensional embeddings from the new models. If you store the original 3072 vector, you can change the size of your vectors later, without re-embedding. This is useful if you want to reduce dimensions to speed up search, at the cost of some embedding accuracy.

2 Likes

it’s probably just because they use some variation of cross entropy for training, maybe with a scaling factor, no?

I can’t think of a good simple loss function that allows for negatives in this context

My theory is that the hidden layer where they are drawing the embedding from is biased, which is apparently common in neural networks.

The new models are less biased than ada-002, but still biased enough to not yield a significant amount of negative dot products through.

Although 0-1 dot products does seem fishy. So maybe they have an unbiased hidden layer followed by some other transformation that essentially puts all vectors on one side of a half hyperplane, or one half of possible allowed vector space.

Something that comes to mind is something like a ReLU activation function MIGHT cause this. I’d have to think about it.

1 Like

Can you tell me the chunk size you used with ada 2? I’m currently testing a tool for getting vectors (or documents) most similar to query. I’m using text-embedding-3-large so that allows me to have 3072 dim vector-space, with a chunk size of 2.5k and some overlap, but the retrieved documents aren’t the best results, or at least as much as I was hoping for.

I was wondering if I should test different vector sizes or different chunk sizes.

1 Like

I’m using 11500 chars (I’m not bothering to determine a token limit)

i am using 300 tokens as chunk and then doing parent retrieval

The strange thing about chunk sizes is that the results can be different for different documents. Some documents have better results with smaller chunks while others have better results with larger chunks. And, it also depends a lot upon the types of questions asked.

This is actually a good strategy to achieve better results, especially if you are using smaller chunk sizes. Advanced RAG 01: Small-to-Big Retrieval | by Sophia Yang, Ph.D. | Towards Data Science

2 Likes

From my experience I would say that this is similar to how nuanced information can cause drastic changes in different domains.

When I switched from ada-2 to ada-3-lg with 256 dimension I had to spend a little more time perfecting the embedding, but once it was done it performed better than the original with 1536.

@cliff4

Don’t use overlap! If you have complete control over your documents and aren’t trying to automate some embedding process you shouldn’t need it! Your embeddings in a perfect world would contain atomic information that has no overlap. Process your documents!

Unfortunately, the documents that I have encountered so far have lot of context overlap. I felt as human ,we tend to repeat the same information, may be to reiterate things. So, the documents written by us have lot of context overlap. I think with Gen AI in the mix, we also need to change how we write documents. We need to probably acquire a specialized writing skill for LLMs to understand.