The range isn’t a problem, it’s just not what people are used to. You can normalise to the range that your data typically spans (as you’ve done). Other solutions are more involved, such as removing some of the top vector components that are common to all embeddings (which don’t seem to code for things that are semantically useful, but are useful to the model). For, me, the normalising to the cosine range 0.8 to 0.9 seems to be enough and it can be useful to inverse this back to degrees too to get a better sense of similarity.
Can you show your work for this?
First, I assume you mean the spherical cap when you write “cone cap?”
Also, intuitively, the points comprising a spherical cap are a subset of the points comprising the sphere, so in no dimension does a spherical cap exceed the surface area of its containing sphere.
For reference, the formula for a hyperspherical cap is,
for \phi \le \frac{\pi}{2}.
I’m also unclear what you mean when you write that below 30^{\circ} the whole “cone cap” is constrained to a single orthant[1]. It certainly could be so constrained, but in practical terms it depends entirely on the orientation of the cone, but I fail to see the relevance here as the full hypersphere sweeps through all orthants.
I am curious where you got this number, because the maximum cone angle is a function of dimension. To the best of my post-it note calculations[2]., the maximal cone angle that can be contained in a single orthant (somewhat shockingly )simplifies to,
So, the widest cone angle[3] that can be contained within a single orthant for some arbitrary dimensions looks like,
Dimension | Radians | Degrees |
---|---|---|
2 | 0.785 | 45.0 |
3 | 0.615 | 35.3 |
4 | 0.524 | 30.0 |
5 | 0.464 | 26.6 |
10 | 0.322 | 18.4 |
100 | 0.100 | 5.7 |
1000 | 0.032 | 1.8 |
10000 | 0.010 | 0.6 |
But, again, I don’t really understand what the fixation is with being constrained to a single orthant. Unless you think by rescaling all of your embeddings to be \in [0, 1]^{1536} will let you use unsigned doubles to store them, which might impart a small performance boost.
Maybe (most likely) I simply failed to understand what you were trying to convey here.
Unless you’re suggesting some sort of rescaling so that all vector elements x_{i}\in [0, 1] ↩︎
I can put together the maths on this later if anyone is interested. ↩︎
Where we are taking the cone angle here to be between the norm of the base and the slant of the cone, double these values for the angle between the two most distant points on the cone base. ↩︎
Good timing, I’ve actually been wondering whether my calculations were off or not so good to have someone check. But, firstly, I do want to re-iterarate that I don’t think there’s much practical value to the result at all; I was just curious to calculate it to understand the counter-intuitive nature of high dimensional hyperspheres and cone caps (and yes: spherical cone cap is what I meant).
It was really a response to someone else’s comment that the apparent dynamic range of cosine sims was very small (which it is when considering two points, as that’s a 2D angle), but the hypercone of all points within a cone of the same apex angle is relatively-speaking absolutely huge in higher dimensions. (I wasn’t claiming the absolute area explodes. I thought I 'd explictly said it goes to zero. Sorry if that wasn’t clear though.)
Anyway, I implicitly assumed a cone that was centred on an orthant’s central radial vector (so it was symmetric and would fit inside the orthant if small enough).
The value of 30 degrees I actually got by experimenting and calculating the ratio of the cone cap area to the area of one orthant (if it’s less, then it presumably would fit entirely inside, and if more, then it must sweep outside it a little in all neighbouring orthants which is why the ratio can be bigger than one orthant - I think, anyway).
I used the same formula as you, I believe. Mine were:
Estimating the surface area of a conic section of a hypersphere
NB: Requires the special functions gamma, \Gamma(x), \
and the normalised incomplete beta function, I_x(a,b)
S_{tot} = A_d(r) = \frac{2 \pi^{\frac{d}{2}} }{\Gamma(\frac{d}{2})}.r^{d-1}
S_{cap} = \frac{1}{2} S_{tot} \cdot I_{sin^2(\theta)}(\textstyle{\frac{d-1}{2},\frac{1}{2}})
and then a single orthant’s area is just S_{tot} divided by the number of orthants.
That’s a fascinating and shocking result, if true. How did you derive that? It doesn’t agree with my experimental results, but, I may have made a mistake in my calculations.
I did start doing some sanity checking of my results in lower dimensions which made me question if the formulae work in 2D and 3D, or I’d got something wrong, but since this doesn’t have any practical value, I haven’t had time to look more deeply.
If you want to double check, though, this was the python implementation:
def validate_hypersphere_function_params_d_r(d,r, theta=None, check_theta=False):
if d < 2 or r <= 0:
raise ValueError(f"Invalid parameter values passed, d: {d} and r: {r}.\n d must be integer >= 2 \n and 0 <= r <= π/2.")
if check_theta and (theta <= 0 or theta >= np.pi / 2):
raise ValueError(f"Invalid parameter value passed, theta: {theta}.\n theta must be in radians and in the range 0 to π/2.")
return
# Function to calculate the total surface area of a D-dimensional hypersphere
def hypersphere_surface_area(d, r):
""" Total area is well-known to be:
$ S_{tot} = A_d(r) = \frac{2 \pi^{\frac{d}{2}} }{\Gamma(\frac{d}{2})}.r^{d-1} $
"""
validate_hypersphere_function_params_d_r(d,r)
# # Set the precision
# mp.dps = 500 # you can set the number of decimal places you need
# return float((2 * (mp.pi**(d/2)) / mp.gamma(d/2) ) * (r**(d-1)))
return (2 * (pi**(d/2)) / gamma(d/2) ) * (r**(d-1))
def hypersphere_cap_area(d, r, theta):
"""
Calculate the surface area of a cap on a hypersphere of dimension d.
Parameters:
d (int): The dimension of the hypersphere.
r (float): The radius of the hypersphere.
theta (float): The angle in radians subtended by the cap at the center of the hypersphere.
Returns:
float: The surface area of the cap.
"""
# Ensure valid input
validate_hypersphere_function_params_d_r(d,r,theta=theta, check_theta=True)
# Calculate the total surface area of the hypersphere
S_tot = hypersphere_surface_area(d, r)
# Calculate the value for x in the regularised incomplete beta function
x = np.sin(theta)**2
# Calculate the regularised incomplete beta function
I_x = betainc((d - 1) / 2, 0.5, x2=x)
# Calculate the surface area of the cap
S_cap = 0.5 * S_tot * I_x
return S_cap
What’s odd also is that the volumes and areas of hypersphere’s go to zero in higher dimensions for fixed radius R (ref)
As for orthants, there are 2^{1536} orthants in ada-002’s embedding space. This is beyond astronomical, and more than the total number of atoms in the universe, and these are just orthants, or chunks of the space that partition the whole space, not isolated points in the space.
So the space is vaster than our known universe!
Yes, I couldn’t wrap my head around the “goes to zero” fact when I first read it. Even less so that fact that volumes & areas first grow to a maximum (at ~ 5D and 7D) and then start to fall off to zero rapidly.
In fact, this is due to just the way we define volume and units, which are always based on the unit square area and volume. Since those are n^2 and n^3 and fill the entire space from (0,1) they grow as such.
Unit spheres, as defined, are more like “two-unit spheres” (their radius is 1, e.g in 2D they cover most of the four unit squares in the four quadrants). However, they don’t cover a fraction of the space in the most distance corner of each and that’s the key to understanding the mystery.
The distance between opposite corners in an n-cube grows as \sqrt{n}, which is unbounded! The distance any sphere in the same space centred on the original will reach towards that far corner is always 1, by definition.
By 4D, you can fit an entire new ball in the gaps between half-unit balls inside the orthants such that they’re “kissing” faces (face centre cubic arrangement), which is hard to visuale since in 2D and 3D, the centre balls are much smaller than the surrounding balls.
So it’s only relative to the (arbitrary) choice of us using the unit n-cube as the the definition of area and volume that the n-ball’s area and volume go to zero. Their volumes and areas continue to explode exponentially too, just way slower than the unit cube does.
Summary created by AI.
In this long and rich thread, users are discussing and troubleshooting the usage of OpenAI’s text-embedding-ada-002 model. In post 1, vanehe08
starts the thread with specific issues encountered when determining semantic similarity between two different sentences using the model. They report getting relatively high similarity scores between semantically different sentences, which is unexpected.
curt.kennedy
in post 2 observes that the engine doesn’t seem to distribute angular breadth widely based on his experience and further presents his concerns around it. In post 3, ruby_coder
adds on, sharing their own experience and observation of the model, where they’ve found that texts’ vectors are based on an OpenAI model and not on the direct analysis of text.
Post 4, follows up on ruby_coder
’s observations offering the perspective that while distance values in cosine similarity fall in a range of about 0.7 to 1, the expectation should range from -1 to +1. ruby_coder
again in post 5 expresses disappointment over the lack of dynamic range.
ruby_coder
goes on to share an extensive list of experiments with the cosine similarity scores of several phrases in the posts 6, posts 7, and post 10. They suggest that Euclidean Distance may be a useful alternative to the Dot Product and Cosine Similarity functions for comparing OpenAI Embedding Vectors.
curt.kennedy
in post 8 suggests a theory that OpenAI might be reserving some of the vector space for other uses and the traditional mathematical meaning of similarity might not translate correctly in this embedding space. Post 11, debreuil
theorizes that negative spikes in embeddings might be positional encoding, which could result in words with the same semantic meaning having different embeddings due to their positions.
Later in the thread, curt.kennedy
in post 24 shares code to perform PCA on embeddings to render them isotropic. raymonddavey
in post 36 suggests that the model might be encoding the frequency of words as well as semantic meaning. Post 37 andpost 42 by curt.kennedy
shares code and further expounds on the importance of reorienting and scaling new vectors in the embedding space discovered through PCA.
In post 63, curt.kennedy
asserts on the importance of starting with a space in everything to be embedded, to expose the string to the overall context and meaning in the embeddings, acknowledging that words that start a document introduce a rarity, causing a shift in embeddings.
Overall, users explored and debated various aspects of the model, related to their use cases, were they compared different similarity function results, talked about working across multiple languages, discussed the presence of bias in the embeddings, and suggested workarounds or adjustments of the model.
Summarized with AI on Nov 24 2023
AI used: gpt-4-32k