The distances_from_embeddings function returns a value comparison

alwaysonline521 · April 5, 2023, 8:27am

I would like to know the output of the distances_from_embeddings function in order to decide whether to use embeddings as the context, and what the expected value of this output should be?

alwaysonline521 · April 5, 2023, 8:49am

Thanks for your reply. My confusion lies in where the threshold value is. For example, I have tried two values, one related value being 0.17865794118525224 and one unrelated value being 0.2547689763820292. The difference between them is not very significant.

AgusPG · April 5, 2023, 8:59am

I’m curious. What is the distances_from_embeddings function?

alwaysonline521 · April 5, 2023, 9:02am

from openai.embeddings_utils import distances_from_embeddings
distances_from_embeddings(q_embeddings, df[‘embedding’], distance_metric=‘cosine’)
Thanks for your reply. This function can be used to return the distance between two vectors.

AgusPG · April 5, 2023, 9:14am

Ahhh gotcha. I usually work in the similarity space instead, so I’m afraid I cannot help with distances. Good luck!

raymonddavey · April 5, 2023, 9:15am

Which embedding engine are you using to get those numbers

The ADA embedding tends to be greater than 0.7

I assume you are using the default Cosine distance or the dot product? (They are the same value in ADA)

AgusPG · April 5, 2023, 9:22am

He’s not using similarity metrics, but distance ones. Probably the cosine distance (1 - cosine similarity) or any other distance metric that you can derive from the similarity. Interestingly enough, the cosine distance is not a formal distance metric, but some others (such as the angular distance) are. For those interested: link

AgusPG · April 5, 2023, 9:25am

In my experience, the optimal value of the threshold definitely depends on your use case. But if you just want a sound default value that you can refine later: 0.21

alwaysonline521 · April 5, 2023, 10:14am

Yes, I am using the distance method instead of cosine_similarity, but I believe that these two methods are equivalent.

alwaysonline521 · April 5, 2023, 10:16am

Thank you for your suggestion, I will try my best to test whether this value is appropriate.

alwaysonline521 · April 5, 2023, 10:20am

Thanks for your reply. I am using ADA embedding (openai.Embedding.create(input=question, model=‘text-embedding-ada-002’)[‘data’][0][‘embedding’]), and I believe that whether it is cosine, distance, or dot, the effect is the same.

raymonddavey · April 5, 2023, 11:23am

Thanks @AgusPG, I didn’t realize you were deducting from 1

If you want a wider range of numbers, there is a thread in the community using other calculation methods. From memory, euclidian gave the largest range of values.

@ruby_coder and @curt.kennedy talk about it on this thread

We used a similar value of 0.79 for our cutoff (which is the same as the suggested 0.21 when you subtract it from zero)

AgusPG · April 5, 2023, 11:28am

It’s so awesome that we coincidentally chose the same default value for the threshold! hahaha.
Definitely 0.79 is the way-to-go when starting the experimentation then

alwaysonline521 · April 5, 2023, 1:16pm

0.79 and 0.21 currently seem like good values.Thanks @raymonddavey @AgusPG

joyasree78 · April 6, 2023, 2:04pm

@alwaysonline521 Which database are you using to store your embeddings? I understand from the discussion that you are creating the question embedding using ADA which probably you are matching with some stored embeddings in a vector database

alwaysonline521 · April 6, 2023, 3:20pm

Pinecone. If I conduct local testing, I will store the data in a CSV file.

akashgarg1164 · April 16, 2024, 9:14am

from typing import List, Optional
from scipy import spatial

def distances_from_embeddings(
    query_embedding: List[float],
    embeddings: List[List[float]],
    distance_metric="cosine",
) -> List[List]:
    distance_metrics = {
        "cosine": spatial.distance.cosine,
        "L1": spatial.distance.cityblock,
        "L2": spatial.distance.euclidean,
        "Linf": spatial.distance.chebyshev,
    }
    distances = [
        distance_metrics[distance_metric](query_embedding, embedding)
        for embedding in embeddings
    ]
    return distances

Topic		Replies	Views
Where is the original "openai.embeddings_utils" in the latest version API embeddings	2	11725	April 16, 2024
Embeddings_utils / distance formulas - where did it move? API	8	9424	April 16, 2024
`text-embedding-ada-002` API	23	17075	February 6, 2024
Embeddings and Cosine Similarity API	20	14459	February 25, 2024
Use embeddings to measures how well an answer fits the question API embeddings	5	354	June 29, 2024

The distances_from_embeddings function returns a value comparison

Related topics