Cosine similarity values and embeddings

I’m using the OpenAI embeddings service and calculating cosine similarity in Java. I am using Milvus VDB for the cosine similarity calcs and also my own calcs.

Normally, cosine values range from -1 to +1 for arbitrary points in an N-dimensional space. But I only get results ranging from 0 to 1. I’m guessing text/image/audio embeddings have certain characteristics that make cosine values restricted in the range from 0 to 1.

Does this make sense? Any help is appreciated.

You are correct in that they should vary from -1 to 1. But the models aren’t exactly “geometrically correct”, so you get 0 to 1. It’s better than ada-002, which only went from 0.7 to 1.

So you have to adjust your thresholds for each model you are working with.

3 Likes

Thanks for confirming Curt!