What is a proper way to combine multiple cosine similarities?

curt.kennedy · March 23, 2024, 3:52pm

For simplicity, you could take the sum of all the vectors, then normalize it back to the unit hypersphere. This would essentially represent a new embedding vector that is the average embedding.

Let S = \sum v_i, for vectors v_i. Then V = \frac{S}{||S||}

To figure out what this new vector represents, correlate it with all your previous labeled embedding vectors. Then take the average or top label as your answer.

If your categories are linearly related, like (0) Awesome, (1) Good, (2) OK, (3) Not Good (4) Terrible, you can average the category integers (or numbers) and then round to the nearest integer as your label for this new average vector.

What I do is take it a step further, and I take correlated weighted averages, and also find the correlated weighted standard deviation.

So I get two numbers, one representing the average label, and the other represents the uncertainty of the label.

Here it the code for doing this that I use:


def correlationWeightedAverageSigma(correlation_scores, label_values):
    # Convert lists to numpy arrays
    correlation_scores = np.array(correlation_scores)
    label_values = np.array(label_values)
    
    # Compute the exponential of each correlation score to use as weights
    exp_scores = np.exp(20*correlation_scores) # added factor of 20 to highly value large correlations over the others.
    
    # Normalize these weights to sum to 1
    weights = exp_scores / np.sum(exp_scores)
    
    # Calculate the weighted average (mean) of the label values
    weighted_average = np.dot(weights, label_values)
    
    # Calculate the weighted squared differences from the weighted mean
    weighted_squared_diffs = weights * (label_values - weighted_average) ** 2
    
    # Sum the weighted squared differences
    sum_weighted_squared_diffs = np.sum(weighted_squared_diffs)
    
    # Calculate the weighted standard deviation
    weighted_std_dev = np.sqrt(sum_weighted_squared_diffs)
    
    # Return both the weighted average and the weighted standard deviation
    return weighted_average, weighted_std_dev

Note that I am exponentially weighting my correlation values, and I have a factor of 20 in there as well. What happens in my case is I have so much data that many of my correlations are very close, and I want to accentuate the differences with this e^{20x} weighting.

You could generalize this to e^{kx} weighting, for k > 0, and tune k to fit your situation. Note k = 0 represents the standard mean.

Also, I do not want to worry about averaging negative correlation scores, so this solves that worry as well.

This approach works well for me in finding top correlations, and the average label for these correlations. But it could be used for you to find the average label, or sentiment, over any given time period, after you dot product each of these with your central mean vector.

This is essentially a formal look at what @anon22939549 posted above, or at least my working version of it. It actually works, and in my case I use it to create embedding based classifiers, without requiring a fine-tuned model.

Topic		Replies	Views
Fine-tuning or update embedding of a String Community embeddings	8	1700	August 14, 2023
Embeddings and Cosine Similarity API	20	14018	February 25, 2024
Adding values to ADA-002 embeddings? API ada	20	1816	September 20, 2023
Embedding Results Scale Seems Off API embeddings , ada	8	4977	December 24, 2023
`text-embedding-ada-002` API	23	16803	February 6, 2024

What is a proper way to combine multiple cosine similarities?

Related topics