Yes, it’s a shame they haven’t released much information on the internals, or access to the embedding model. Perhaps they will later as they did with gpt-2 eventually. It would be nice to know how the embeddings are created and access the decoder part of the model to generate texts using embeddings (but I think this likely poses a security risk?)
Like you, the lowest scores I found were ~0.6 (52.5 degrees) and these were generally in different domains (e.g. natural language vs code) which makes sense.
I did find lower scores using two strings found by @anon22939549 in this post: /t/fine-tuning-or-update-embedding-of-a-string/320955/9
He said he found them quickly but didn’t say how.
I don’t think it’s worth spending much time worrying about the dynmamic range though. It’s really not an issue.
It’s also illusory IMO: it’s just not what people are used to from smaller models, but it’s easily restored just by renormalising the values. Subtract 0.75
(as I mentioned before, that won’t restore the expected order of “uncorrelated” and “anticorrelated” from a semantic perspective - if that’s needed - I don’t need that for now in my work).
Simple example similarity function
def similarity_transform(cosine_sim, min_deg = 25, max_deg = 37):
# 2023-10-20 Author: Gruff Davies
"""
Dependencies: import numpy as np
Transforms and renormalises cosine similarities between ada-002 embeddings to give more useful predictive values.
Output ranges from (0, 1) not (-1, +1) since "anti-similar" is just not likely to ever occur in high D representations.
1. First converts cosine similarity into angular separation in degrees which significantly helps "stretch" squashed results
in to a more useful range.
2. renormalises range using supplied min and max separations (experiment to find good values for your data)
Returns:
- a similarity metric in the range (0,1)
"""
deg = np.degrees(np.arccos(cs)) # optional, but nice and intuitive
similarity = (max_deg - np.clip(deg, min_deg, max_deg)) / (max_deg-min_deg)
return similarity
More sophisticated approaches
One way to do that is figure out the feature dimensions of linguistic interest to ones specific application and project onto a smaller space with those bases only, (making sure to send the irrelevant dominant bases to null). Identifying the noise contributors seems fairly straight-forward. Even applying a “get rid of the top 1%” strategy seems to be effective according to the paper you shared, but I think a bit analysis first would be prudent.
I would caution though, that expecting -1, 0 and +1 for cosine similarities is really a heuristic learned from the early days of word2vec which has aged badly now we have high D embeddings. Even if you get a clean large set of semantically meaningful feature bases, you will still need every single feature reversed to get a strong negative comparison. If you have several hundred, this just isn’t going to happen. Almost all texts have so much in common they’re bound to produce high similarity scores.
Where specific features are of interest (e.g. sentiment), extract those features (create a transform map to a small set) and then do CS.
Personally, I wouldn’t bother to calculate averages vectors and subtract them as that depends too heavily on generating from a good set, though it is a good idea and seems to work, it might introduce unexpected bias and fail on certain outliers.
Gruff
Oh, one final thought: I get why you might be perplexed about the embeddings “living in a tiny cone” but I would actually question that.
The embeddings seem to be in a hypercone with an apex angle of ~60 degrees in 1536 dimensions. That’s unimaginably vast. I’ve only tested small texts. I don’t know how large texts of 8k tokens embed - have you - or anyone reading this thread - done tests on big texts? Maybe they extend further?
As a fraction of the entire embedding space it seems small, but that space is so unimaginably in the first place it’s almost inevitable. The hypersphere itself is tiny compared the volume of the embedding space (they actually tend to 0% the size of the unit hypercube in the limit).
It’s only an issue if the anisotropy impacts performance, which I don’t think it does. It feels weird from a 3D perspective, but it’s hard to develop good intuitions about very high D spaces. Things get quite weird.