Text embedding: cosine similarity

ibrahimakh47 · July 13, 2023, 5:51am

I am using openai’s embedding api and calculating cosine similarity between vectors. Now, i want to now that is there any threshold for cosine similarity? How can i determine the optimal threshold point for cosine similarity in vector similarity research?

udm17 · July 13, 2023, 5:54am

Threshold for vector similarity depends solely on problem to problem. Optimally, based on how strict you want the matching to be, it can range between 0.7 and 0.9. However, it best based on testing your problem and see what threshold works for you

ibrahimakh47 · July 13, 2023, 5:59am

i am working on a bot and it can take questions and their answers from owner. When user enters question it will use vector similarity search.

_j · July 13, 2023, 6:11am

It seems you’d want a “best of” scenario in that case, not a threshold. An amount of context that the bot can use and its context length can handle.

What if you run a science supply company, and the user searches “scientific equipment”?

Foxalabs · July 13, 2023, 6:13am

you can build up an evaluation test set of examples to test against, you embed your test dataset and then test retrievals on that from a predefined retrieval prompt set (a pre tested set where you know quantitatively how similar they are) then you can test any given embedding model for it’s unique similarity value, once you have defined what 2 things that are not alike at all is numerically and what two things that are identical are, you have some baseline numbers to go from. (ps. some models give very high values for things that seem totally different, it’s just a matter of magnitude)

i.e. you might have one model that thinks that “the sun is yellow” and “boxes are used for storage” are 0.77 similar not 0.01 as you might expect.

merefield · June 30, 2024, 4:06pm

My bot uses both a threshold and an order. The threshold helps kick out wasteful low matches which aren’t really an answer.

Topic		Replies	Views
Rule of thumb cosine similarity thresholds? API	6	3731	November 8, 2024
Semantic search through embeddings API	3	1263	January 22, 2023
Embeddings and Cosine Similarity API	20	13946	February 25, 2024
Why cosine_similarity between embedding vectors is always above .68 API embeddings	6	3332	March 1, 2024
Query embedding threshold evaluation with curbing dimension API embeddings	1	133	August 5, 2024

Text embedding: cosine similarity

Related topics