Text embedding: cosine similarity

I am using openai’s embedding api and calculating cosine similarity between vectors. Now, i want to now that is there any threshold for cosine similarity? How can i determine the optimal threshold point for cosine similarity in vector similarity research?

Threshold for vector similarity depends solely on problem to problem. Optimally, based on how strict you want the matching to be, it can range between 0.7 and 0.9. However, it best based on testing your problem and see what threshold works for you

i am working on a bot and it can take questions and their answers from owner. When user enters question it will use vector similarity search.

It seems you’d want a “best of” scenario in that case, not a threshold. An amount of context that the bot can use and its context length can handle.

What if you run a science supply company, and the user searches “scientific equipment”?

you can build up an evaluation test set of examples to test against, you embed your test dataset and then test retrievals on that from a predefined retrieval prompt set (a pre tested set where you know quantitatively how similar they are) then you can test any given embedding model for it’s unique similarity value, once you have defined what 2 things that are not alike at all is numerically and what two things that are identical are, you have some baseline numbers to go from. (ps. some models give very high values for things that seem totally different, it’s just a matter of magnitude)

i.e. you might have one model that thinks that “the sun is yellow” and “boxes are used for storage” are 0.77 similar not 0.01 as you might expect.

2 Likes