Text Similarity Models - Embedding API Query

Hi - I am using the Text Similarity Models - Embedding API to compute similarity score between two words. I am referring to the code snippet from this blog (copied below) - Introducing Text and Code Embeddings in the OpenAI API

The result is a “similarity score”, sometimes called cosine similarity between –1 and 1, where a higher number means more similarity.

The similarity score is in decimals and pretty close to each other. e.g.
Castle – Palace - the score is 0.88
Building – Palace - the score is 0.85
Laptop – Palace - the score is 0.78

Is there a way I get the score in percentage between 0 and 100% mapping to the similarity e.g.
Castle – Palace - the score should be around 95%
Building – Palace - the score should be around 50%
Laptop – Palace - the score should be around 10%

I tried looking in soft cosine, euclidean distance, etc. but still unable to find a good solution. Unfortunately, I don’t fully understand the math behind it.
Any help is greatly appreciated. Thanks!

import openai, numpy as np

resp = openai.Embedding.create(
    input=["feline friends go", "meow"],
    engine="text-similarity-davinci-001")

embedding_a = resp['data'][0]['embedding']
embedding_b = resp['data'][1]['embedding']

similarity_score = np.dot(embedding_a, embedding_b)

Can you write some logic that turns the decimal into a percentage?

You just have to add 1 then divide by 2.

The cosine similarity values are non linear and hence a direct conversion to percentage doesn’t yield correct result. E.g 0.8 and 0.7 values will result in 80% and 70% with this method. However, the similarity score is actually much wider… probably something like 80% and 20%.

Regards,
Sandeep Khomne

1 Like

I see, thank you for educating me

It is more expensive and less effective that State of the art alternatives.
Lots of good studies everywhere.

Thanks for the suggestion and the link. Will have a look and see if I uncover something.