What is a proper way to combine multiple cosine similarities?

Greetings,

My system collects specific news across the globe and calculates cosine similarities toward specific sentiments. It is working very well.

At this point, I need to calculate daily, weekly and monthly sentiments. That means I need to merge all cosine similarities during the day and provide daily (or weekly, monthly) trends.

What is the proper mathematical way to combine multiple cosine similarities in a time period to have aggregated weekly similarity?

4 Likes

That’s not really how cosine similarity works—that is to say there is no meaningful way to do this.

1 Like

How about a daily average:

image

It is showing meaningful result when some of the events spike in the news.

But I hope I get better mathematical way of doing this.

No.

Why wouldn’t you just work with the calculated sentiments? This seems like a simple aggregation task

Sorry, not following, what is calculated sentiments?

I have all the similarities calculated for news collected every second. The cosine similarities show in a time series.

I just need to aggregate similarities in the time series from seconds to hours/day/week.

I can visually see the average is working well. I just need to know if what I am doing aligns with best practices.

You’re correlating the results to sentiments, right?

So why not just use that? Am I missing something here? What you’re asking is very strange

So you are saying the score is plotted in a time series. I’m guessing you’re assuming that the score is the confidence/weight of the sentiment. So a score of .9 in “Happy” would be a very happy article?

So what? You want to have a graph for every sentiment you are tracking? Or are you combining this?

Greetings!

One thing you may wanna consider depends on what embedding model you’re using - ADA, or text embedding 3?

In theory, cosine similarity is basically the normalized dot product of two vectors.

Now the question is, do you want to get the mean cosine of the angles, or do you want the cosine of the mean angle?

wikipedia has a good article with the geometric interpretation here:

Dot product - Wikipedia

The Ada vs TE3 issue is that ada never gives you orthogonal vectors, so a linear mean (like you described) may be good enough.

With TE3 you can check if your sentiment references are actually complementary of how confounded they are.


Overall it depends a bit on what kind of question you’re asking and want answered :thinking:

I’d personally think a histogram would be better so you can observe the distribution, which may be important, but overall, depends on what you need, of course :slight_smile:

3 Likes

Thank you for help. Appreciate it.

I use quantified confluences (or features in ANN CNN) like the following to project cosine similarity to a number 1-100.

<< CAN’T POST A LINK TO MEDIA BIAS DISTRIBUTION CHART HERE :frowning: >>

My system has collected two years of news articles and transformed them to a Vector database.

When I track topics like “vaccine” or “Inflation”, the time series visualization is working well. Please note I am tracking ~200 topics.

I ended up with massive number of items so my visualization engine struggles.

I don’t need up to second numbers. I need daily numbers, so I can measure corelates in past two years.

The Average methos I explained above is working. I just wanted to come out of my shell and see how other people are solving the same problem.

At the same time I am confuse by firm dismissive answers above. Because cosine similarities are like distances. A shipping company can measure average daily distances each trick is traveling add multipliers like truck model etc. WHY can’t I use a similar approach when I measure distances between news articles and pre defined topics?

BTW, I am blown away when I work with embedding vectors and similarities in practice.
FYI, this discussion was extremely helpful.

1 Like

I’m thinking that the derivative of a smooted theta could be really interesting here - it should spike and give you a signal when things are in the middle of changing :thinking:

1 Like

Would you refer me to a learning resource explains “derivative of a smoothed theta”

Hi @ptrader

If I understand correctly, you want to show how the trend of a sentiment(“Happiness” for example) about a specific topic varies with time based on the news about it?

1 Like

Embeddings are gigantic vectors. 1500-3000-150000 dimensions or whatever.

For a lot of models, you can think of them as populating the surface of a hypersphere (just a high dimensional sphere, just the surface). You’re comparing these surface positions.

you have your reference vectors, the

All your news articles are wiggling and jiggling in that high dimensional space, but when you compare it to a reference vector, you smoosh everything down to a two dimensional plane (more or less). you have your origin, the target coordinate, and the reference coordinate. those three points form an angle, that angle’s got a theta. cosine of this theta is your cosine similarity.

If you have a complementary reference that stands 90° to your other reference (cosine similarity zero), you are also spanning a plane. Maybe it’s the happy/sad plane, for example. you can project all your news articles onto that plane and get a happy angle and a sad angle, for example (oversimplified, but you get the gist)

the derivative of the smoothed cosine/sine of the progression of this theta over time is what I meant.


:thinking:

come to think of it, it’s almost exactly what you’ve described here with some extra steps lol (minus the deivatives, but the changes are obvious when you look at the trend, as you mentioned):

To be honest, I wouldn’t be surprised if the difference between what you’re doing, and a geometrically/mathematically well-founded approach was absolutely negligible.

2 Likes

Yes, \At the same time I like to avoid the term “sentiment” since it might distract from the core point.

I just simply want average distances - similar to when we average our daily commute mileage.

Thanks so much for help.

I need a bit of time to go deeper through what you explained.

I will be provide the outcome here.

I expect the correlation matrix will be entertaining :wink:

2 Likes

Here’s a dead simple experiment you can do right now assuming you have at least one thumb and two index fingers.

Let your thumb be your reference vector. Your two index fingers will represent different embeddings.

Arrange them however you like, such that both index finger vectors are emanating from the base of your thumb.

The angles between each index finger and the thumb stretcher the cosine similarities between the embedded vectors and your reference point.

Now, note you can rotate your free index finger around the axis of the thumb, this doesn’t change the cosine similarity between the free index finger and thumb, but it can have a tremendous effect on the cosine similarity between the two index fingers.

Also note, cosine similarity is not a perfect measure of semantic similarity and certainly not a perfect proxy for sentiment agreement.

So, staying in three-dimensions for a bit, we’ve established that cosine similarly doesn’t care where in the space you’re pointing, just the angular distance from the frame of reference. Let’s envision the space now as a ball, and our reference frame (thumb) points to the North Pole of our ball.

An embedding with an arbitrary cosine similarity is no different than any other with that same cosine similarity (with respect to this reference frame). So we can abstract this one vector a bit to be the ring on the surface of this ball formed by rotating the vector around our z-axis/reference vector/thumb.

Continuing on, with respect to this one frame of reference, all of the cosine similarities for all of the text you embed will all form their own rings on this reference ball.

It’s not clear or obvious what a simple (or even weighted) mean of the angular distances of these rings from the pole should represent—especially before we’ve rigorously established that the cosine similarity is a good proxy for sentiment alignment.

I agree that intuitively it at least seems like a good starting point for experimentation. But, I’d be hard pressed to try to justify it mathematically without putting quite a bit more thought into it first.

Also note, so far we’ve been thinking and discussing in three-dimensions. When we move to 1,500 or 3,000 dimensions things get a whole lot trickier—this is where a lot of my hesitation regarding interpretability of the average cosine similarity comes from.

It’s just really not clear what it really means in terms of sentiment for a cosine similarity of 0.72 between an embedding of bit of text and a reference embedding today, and a cosine similarity of 0.75 between another bit of text and that same reference vector tomorrow, especially because we do not have the capability of ensuring the cosine similarity measured is only capturing the sentiment represented by the reference vector.

For instance, perhaps 80% of the cosine similarity represents sentiment alignment with the reference and the other 20% is, say, the structural format of the text.

Any perceived trend in sentiment could be masked by the “noise” of the structure.

There may be ways to possibly mitigate this—multiple different references for the same sentiment springs to mind—but without a clear idea of what you need to control for this is a difficult task at best.

Huge amounts of data might smooth out some of the noise, at least enough to be able to suggest a trend of some sort.

Having written all that, you should feel free to just go ahead and do what you’re doing anyway—but continue to experiment and look for any new literature which addresses these issues because,

If the results you’re getting doing process X are better than the results you get not doing process X, it would be foolish to not do X —Even if you can’t explain rigorously how X works, just keep an eye on it, be skeptical, and be willing to change course if X stops working.

3 Likes

For simplicity, you could take the sum of all the vectors, then normalize it back to the unit hypersphere. This would essentially represent a new embedding vector that is the average embedding.

Let S = \sum v_i, for vectors v_i. Then V = \frac{S}{||S||}

To figure out what this new vector represents, correlate it with all your previous labeled embedding vectors. Then take the average or top label as your answer.

If your categories are linearly related, like (0) Awesome, (1) Good, (2) OK, (3) Not Good (4) Terrible, you can average the category integers (or numbers) and then round to the nearest integer as your label for this new average vector.

What I do is take it a step further, and I take correlated weighted averages, and also find the correlated weighted standard deviation.

So I get two numbers, one representing the average label, and the other represents the uncertainty of the label.

Here it the code for doing this that I use:


def correlationWeightedAverageSigma(correlation_scores, label_values):
    # Convert lists to numpy arrays
    correlation_scores = np.array(correlation_scores)
    label_values = np.array(label_values)
    
    # Compute the exponential of each correlation score to use as weights
    exp_scores = np.exp(20*correlation_scores) # added factor of 20 to highly value large correlations over the others.
    
    # Normalize these weights to sum to 1
    weights = exp_scores / np.sum(exp_scores)
    
    # Calculate the weighted average (mean) of the label values
    weighted_average = np.dot(weights, label_values)
    
    # Calculate the weighted squared differences from the weighted mean
    weighted_squared_diffs = weights * (label_values - weighted_average) ** 2
    
    # Sum the weighted squared differences
    sum_weighted_squared_diffs = np.sum(weighted_squared_diffs)
    
    # Calculate the weighted standard deviation
    weighted_std_dev = np.sqrt(sum_weighted_squared_diffs)
    
    # Return both the weighted average and the weighted standard deviation
    return weighted_average, weighted_std_dev

Note that I am exponentially weighting my correlation values, and I have a factor of 20 in there as well. What happens in my case is I have so much data that many of my correlations are very close, and I want to accentuate the differences with this e^{20x} weighting.

You could generalize this to e^{kx} weighting, for k > 0, and tune k to fit your situation. Note k = 0 represents the standard mean.

Also, I do not want to worry about averaging negative correlation scores, so this solves that worry as well.

This approach works well for me in finding top correlations, and the average label for these correlations. But it could be used for you to find the average label, or sentiment, over any given time period, after you dot product each of these with your central mean vector.

This is essentially a formal look at what @anon22939549 posted above, or at least my working version of it. It actually works, and in my case I use it to create embedding based classifiers, without requiring a fine-tuned model.

1 Like

Priceless!

Add to +25 characters