For simplicity, you could take the sum of all the vectors, then normalize it back to the unit hypersphere. This would essentially represent a new embedding vector that is the average embedding.
Let S = \sum v_i, for vectors v_i. Then V = \frac{S}{||S||}
To figure out what this new vector represents, correlate it with all your previous labeled embedding vectors. Then take the average or top label as your answer.
If your categories are linearly related, like (0) Awesome, (1) Good, (2) OK, (3) Not Good (4) Terrible, you can average the category integers (or numbers) and then round to the nearest integer as your label for this new average vector.
What I do is take it a step further, and I take correlated weighted averages, and also find the correlated weighted standard deviation.
So I get two numbers, one representing the average label, and the other represents the uncertainty of the label.
Here it the code for doing this that I use:
def correlationWeightedAverageSigma(correlation_scores, label_values):
# Convert lists to numpy arrays
correlation_scores = np.array(correlation_scores)
label_values = np.array(label_values)
# Compute the exponential of each correlation score to use as weights
exp_scores = np.exp(20*correlation_scores) # added factor of 20 to highly value large correlations over the others.
# Normalize these weights to sum to 1
weights = exp_scores / np.sum(exp_scores)
# Calculate the weighted average (mean) of the label values
weighted_average = np.dot(weights, label_values)
# Calculate the weighted squared differences from the weighted mean
weighted_squared_diffs = weights * (label_values - weighted_average) ** 2
# Sum the weighted squared differences
sum_weighted_squared_diffs = np.sum(weighted_squared_diffs)
# Calculate the weighted standard deviation
weighted_std_dev = np.sqrt(sum_weighted_squared_diffs)
# Return both the weighted average and the weighted standard deviation
return weighted_average, weighted_std_dev
Note that I am exponentially weighting my correlation values, and I have a factor of 20 in there as well. What happens in my case is I have so much data that many of my correlations are very close, and I want to accentuate the differences with this e^{20x} weighting.
You could generalize this to e^{kx} weighting, for k > 0, and tune k to fit your situation. Note k = 0 represents the standard mean.
Also, I do not want to worry about averaging negative correlation scores, so this solves that worry as well.
This approach works well for me in finding top correlations, and the average label for these correlations. But it could be used for you to find the average label, or sentiment, over any given time period, after you dot product each of these with your central mean vector.
This is essentially a formal look at what @anon22939549 posted above, or at least my working version of it. It actually works, and in my case I use it to create embedding based classifiers, without requiring a fine-tuned model.