It looks like 'text-embedding-3' embeddings are truncated/scaled versions from higher dim version

Started monkeying with a bit of my “playground” and 3-large…

== Sample from 3 d=1024 vectors returned ==
0 [‘-0.048355922’, ‘0.030448081’, ‘-0.002223636’, ‘-0.012373986’]
1 [‘0.000544521’, ‘-0.052708060’, ‘0.037970472’, ‘0.024555754’]
2 [‘-0.018499656’, ‘-0.023828303’, ‘0.041195750’, ‘-0.011031237’]

== Sample from 3 d=3072 vectors returned ==
0 [‘-0.035144586’, ‘0.022129351’, ‘-0.001616116’, ‘-0.008993286’]
1 [‘0.000397846’, ‘-0.038510315’, ‘0.027742529’, ‘0.017941277’]
2 [‘-0.013463791’, ‘-0.017341908’, ‘0.029981693’, ‘-0.008028381’]

== Cosine similarity and vector comparison of all inputs @ 3072 ==
0:“Jet pack” <==> 1:“tonal language”:
0.0293138009386300 - identical: False
0:“Jet pack” <==> 2:“It’s greased lightning!”:
0.2810681077365715 - identical: False
1:“tonal language” <==> 2:“It’s greased lightning!”:
0.0498102101138038 - identical: False

So yes, the 1024 are extracted right from the initial 3072. We can see the same sign and similar magnitudes.

I wonder if they did any remapping of highest relevance dimensions, as it was found that a lot of distinguishment was found in the middle 1/3 in other embeddings models.


A very different range of cosine similarity values than the thresholds ~0.8 of ada-002:

== Cosine similarity and vector comparison of all inputs @ 1536 ==
0:“Jet pack” <==> 1:“Jet pack”:
1.0000000000000002 - identical: True
0:“Jet pack” <==> 2:“tonal language”:
0.7175784894183529 - identical: False
0:“Jet pack” <==> 3:“It’s greased lightning!”:
0.7802673329570928 - identical: False
1:“Jet pack” <==> 2:“tonal language”:
0.7175784894183529 - identical: False
1:“Jet pack” <==> 3:“It’s greased lightning!”:
0.7802673329570928 - identical: False
2:“tonal language” <==> 3:“It’s greased lightning!”:
0.7006591403724325 - identical: False


PS; API base64 → 32 bit float single, then math with double. Display vectors 250-254 rounded from ~16 digits.


def cosine_similarity(asingle, bsingle) -> np.double:
    """return normalized dot product of two arrays"""
    a = asingle.astype(np.double)
    b = bsingle.astype(np.double)
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
4 Likes