It looks like 'text-embedding-3' embeddings are truncated/scaled versions from higher dim version

_j · January 26, 2024, 9:16am

Started monkeying with a bit of my “playground” and 3-large…

== Sample from 3 d=1024 vectors returned ==
0 [‘-0.048355922’, ‘0.030448081’, ‘-0.002223636’, ‘-0.012373986’]
1 [‘0.000544521’, ‘-0.052708060’, ‘0.037970472’, ‘0.024555754’]
2 [‘-0.018499656’, ‘-0.023828303’, ‘0.041195750’, ‘-0.011031237’]

== Sample from 3 d=3072 vectors returned ==
0 [‘-0.035144586’, ‘0.022129351’, ‘-0.001616116’, ‘-0.008993286’]
1 [‘0.000397846’, ‘-0.038510315’, ‘0.027742529’, ‘0.017941277’]
2 [‘-0.013463791’, ‘-0.017341908’, ‘0.029981693’, ‘-0.008028381’]

== Cosine similarity and vector comparison of all inputs @ 3072 ==
0:“Jet pack” <==> 1:“tonal language”:
0.0293138009386300 - identical: False
0:“Jet pack” <==> 2:“It’s greased lightning!”:
0.2810681077365715 - identical: False
1:“tonal language” <==> 2:“It’s greased lightning!”:
0.0498102101138038 - identical: False

So yes, the 1024 are extracted right from the initial 3072. We can see the same sign and similar magnitudes.

I wonder if they did any remapping of highest relevance dimensions, as it was found that a lot of distinguishment was found in the middle 1/3 in other embeddings models.

A very different range of cosine similarity values than the thresholds ~0.8 of ada-002:

== Cosine similarity and vector comparison of all inputs @ 1536 ==
0:“Jet pack” <==> 1:“Jet pack”:
1.0000000000000002 - identical: True
0:“Jet pack” <==> 2:“tonal language”:
0.7175784894183529 - identical: False
0:“Jet pack” <==> 3:“It’s greased lightning!”:
0.7802673329570928 - identical: False
1:“Jet pack” <==> 2:“tonal language”:
0.7175784894183529 - identical: False
1:“Jet pack” <==> 3:“It’s greased lightning!”:
0.7802673329570928 - identical: False
2:“tonal language” <==> 3:“It’s greased lightning!”:
0.7006591403724325 - identical: False

PS; API base64 → 32 bit float single, then math with double. Display vectors 250-254 rounded from ~16 digits.


def cosine_similarity(asingle, bsingle) -> np.double:
    """return normalized dot product of two arrays"""
    a = asingle.astype(np.double)
    b = bsingle.astype(np.double)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

Topic		Replies	Views
Why `OpenAI Embedding` return different vectors for the same text input? API	35	10433	April 30, 2024
Which database tools suit for storing embeddings generated by the Embedding endpoint? API	46	26056	December 13, 2023
Some questions about text-embedding-ada-002’s embedding API	146	43648	December 13, 2023
Can text-embedding-ada-002 be made deterministic? API embeddings , ada	18	7846	December 24, 2023
Embeddings: The average and extreme values within dimensions of 3-large API	0	127	November 21, 2024

It looks like 'text-embedding-3' embeddings are truncated/scaled versions from higher dim version

Related topics