Rule of thumb cosine similarity thresholds?

I have used 0.79 as the cosine similarity threshold for text-embedding-ada-002. This means that any lower value would not be considered similar enough to be included in the context.

However, upon utilizing text-embedding-3-large, the same threshold no longer seems effective. Initial tests indicate that a lower threshold number should be chosen.

I’m curious to learn about the rule of thumb for the similarity threshold that people have settled on with text-embedding-3-large.

It’s all relative.

You’ll need to see how your text clusters with the new embeddings.

The general observation has been that similarities are much lower across the board.

That’s because there isn’t a definitive answer to your question.


tldr: 0.3

Yes, using rule-of-thumb values to start one’s own exploration is a good way forward for developers who are mostly interested in practical outcomes.

You can take a look at what other users from the community have found to work well for them. And from there you need to fine-tune your approach to your use case.

