Here’s a dead simple experiment you can do right now assuming you have at least one thumb and two index fingers.
Let your thumb be your reference vector. Your two index fingers will represent different embeddings.
Arrange them however you like, such that both index finger vectors are emanating from the base of your thumb.
The angles between each index finger and the thumb stretcher the cosine similarities between the embedded vectors and your reference point.
Now, note you can rotate your free index finger around the axis of the thumb, this doesn’t change the cosine similarity between the free index finger and thumb, but it can have a tremendous effect on the cosine similarity between the two index fingers.
Also note, cosine similarity is not a perfect measure of semantic similarity and certainly not a perfect proxy for sentiment agreement.
So, staying in three-dimensions for a bit, we’ve established that cosine similarly doesn’t care where in the space you’re pointing, just the angular distance from the frame of reference. Let’s envision the space now as a ball, and our reference frame (thumb) points to the North Pole of our ball.
An embedding with an arbitrary cosine similarity is no different than any other with that same cosine similarity (with respect to this reference frame). So we can abstract this one vector a bit to be the ring on the surface of this ball formed by rotating the vector around our z-axis/reference vector/thumb.
Continuing on, with respect to this one frame of reference, all of the cosine similarities for all of the text you embed will all form their own rings on this reference ball.
It’s not clear or obvious what a simple (or even weighted) mean of the angular distances of these rings from the pole should represent—especially before we’ve rigorously established that the cosine similarity is a good proxy for sentiment alignment.
I agree that intuitively it at least seems like a good starting point for experimentation. But, I’d be hard pressed to try to justify it mathematically without putting quite a bit more thought into it first.
Also note, so far we’ve been thinking and discussing in three-dimensions. When we move to 1,500 or 3,000 dimensions things get a whole lot trickier—this is where a lot of my hesitation regarding interpretability of the average cosine similarity comes from.
It’s just really not clear what it really means in terms of sentiment for a cosine similarity of 0.72 between an embedding of bit of text and a reference embedding today, and a cosine similarity of 0.75 between another bit of text and that same reference vector tomorrow, especially because we do not have the capability of ensuring the cosine similarity measured is only capturing the sentiment represented by the reference vector.
For instance, perhaps 80% of the cosine similarity represents sentiment alignment with the reference and the other 20% is, say, the structural format of the text.
Any perceived trend in sentiment could be masked by the “noise” of the structure.
There may be ways to possibly mitigate this—multiple different references for the same sentiment springs to mind—but without a clear idea of what you need to control for this is a difficult task at best.
Huge amounts of data might smooth out some of the noise, at least enough to be able to suggest a trend of some sort.
Having written all that, you should feel free to just go ahead and do what you’re doing anyway—but continue to experiment and look for any new literature which addresses these issues because,
If the results you’re getting doing process X are better than the results you get not doing process X, it would be foolish to not do X —Even if you can’t explain rigorously how X works, just keep an eye on it, be skeptical, and be willing to change course if X stops working.