I would like to know the output of the distances_from_embeddings function in order to decide whether to use embeddings as the context, and what the expected value of this output should be?
Thanks for your reply. My confusion lies in where the threshold value is. For example, I have tried two values, one related value being 0.17865794118525224 and one unrelated value being 0.2547689763820292. The difference between them is not very significant.
I’m curious. What is the distances_from_embeddings
function?
from openai.embeddings_utils import distances_from_embeddings
distances_from_embeddings(q_embeddings, df[‘embedding’], distance_metric=‘cosine’)
Thanks for your reply. This function can be used to return the distance between two vectors.
Ahhh gotcha. I usually work in the similarity space instead, so I’m afraid I cannot help with distances. Good luck!
Which embedding engine are you using to get those numbers
The ADA embedding tends to be greater than 0.7
I assume you are using the default Cosine distance or the dot product? (They are the same value in ADA)
He’s not using similarity metrics, but distance ones. Probably the cosine distance
(1 - cosine similarity
) or any other distance metric that you can derive from the similarity. Interestingly enough, the cosine distance is not a formal distance metric, but some others (such as the angular distance
) are. For those interested: link
In my experience, the optimal value of the threshold definitely depends on your use case. But if you just want a sound default value that you can refine later: 0.21
Yes, I am using the distance method instead of cosine_similarity, but I believe that these two methods are equivalent.
Thank you for your suggestion, I will try my best to test whether this value is appropriate.
Thanks for your reply. I am using ADA embedding (openai.Embedding.create(input=question, model=‘text-embedding-ada-002’)[‘data’][0][‘embedding’]), and I believe that whether it is cosine, distance, or dot, the effect is the same.
Thanks @AgusPG, I didn’t realize you were deducting from 1
If you want a wider range of numbers, there is a thread in the community using other calculation methods. From memory, euclidian gave the largest range of values.
@ruby_coder and @curt.kennedy talk about it on this thread
We used a similar value of 0.79 for our cutoff (which is the same as the suggested 0.21 when you subtract it from zero)
It’s so awesome that we coincidentally chose the same default value for the threshold! hahaha.
Definitely 0.79 is the way-to-go when starting the experimentation then
0.79 and 0.21 currently seem like good values.Thanks @raymonddavey @AgusPG
@alwaysonline521 Which database are you using to store your embeddings? I understand from the discussion that you are creating the question embedding using ADA which probably you are matching with some stored embeddings in a vector database
Pinecone. If I conduct local testing, I will store the data in a CSV file.
from typing import List, Optional
from scipy import spatial
def distances_from_embeddings(
query_embedding: List[float],
embeddings: List[List[float]],
distance_metric="cosine",
) -> List[List]:
distance_metrics = {
"cosine": spatial.distance.cosine,
"L1": spatial.distance.cityblock,
"L2": spatial.distance.euclidean,
"Linf": spatial.distance.chebyshev,
}
distances = [
distance_metrics[distance_metric](query_embedding, embedding)
for embedding in embeddings
]
return distances