Is it possible to achieve embeddings cosine similarity approaching -1?

The closest i’ve been able to achieve is -0.003850111554290606. Is perfect opposition impossible because all words share context and some semantic content? Is there research on this?

Welcome to the community!

I don’t know if that’s a meaningful result, I suspect it might be effectively 0. Could be a rounding thing.

I don’t think it’s possible with text-embedding-3, because it doesn’t look like it’s been trained for that.

If you look at ada-2, it had a minimum cosine similarity of around 0.6.

I suspect the reason why it doesn’t work is because it would be difficult to train. What would be a good -1 example, for example? It’s possible that this is more of a philosophical thing than a technology thing :thinking:

1 Like

Seems like one way to have the embeddings cosine similarity approach -1 is to reduce the number of dimensions.

‘fish’ and ‘bicycle’ have -0.9776305952877404 embeddings cosine similarity in two dimensions.

Where can I find more research on this? For example, fish and bicycle seem very, very dissimilar (almost perfect opposites) in two dimensions but merely dissimilar in 3072 dimensions (embedding-3-large).

I made a post a while ago here:

This will make your embeddings more isotropic (more spread out) and get you more toward -1. But it requires post processing a large batch of previous embeddings.

But the what/how/why of what causing the bias in the embeddings get’s into potential biases in the hidden layers that are creating the embedding vectors.

So ultimately, the most practical solution is to adjust your thresholds for each embedding model you encounter, as they all seem pretty different, and do no conform to normal vector geometry expectations.

3 Likes

I’m inspired by embeddings and thought it might be useful to describe them in a fun, accessible way to spark a broader dialogue. Who else agrees?

1 Like

Interesting post by “Mishtert T” discussing algebra of embeddings. Makes the point that queen-woman+man=king

When I tried this with 30 dimensions, there was cosign similarity of 0.5412786418597273. Does that seem low?