Splitting text into chunks versus reducing the text

Determinism means you always get the same output from the same input. The way you expect computer code to work, basically.

The current embeddings models are not deterministic. They do not produce the same output for the same input.


I performed 10 embeddings runs, on 3-small model, of the same 600 tokens of text, and got three unique results with tensor comparisons:


np.unique(em_ndarray, axis=0)
array([[ 0.01526077,  0.01427238,  0.06942467, ...,  0.00670789,
        -0.01064828, -0.01102387],
       [ 0.01529922,  0.01424501,  0.06947242, ...,  0.00671729,
        -0.01068046, -0.01100331],
       [ 0.01530029,  0.01423283,  0.06947729, ...,  0.00672765,
        -0.01066144, -0.01100408]], dtype=float32)

Similarity between embeddings 4 and 5: 1.0000000000
Similarity between embeddings 4 and 6: 0.9999994040
Similarity between embeddings 4 and 7: 1.0000000000
Similarity between embeddings 4 and 8: 0.9999997616
Similarity between embeddings 4 and 9: 0.9999997616…

With 3-large and 3092 dimensions, running 20 trials, none of them were the same.

Number of unique embeddings: 20
Similarity between embeddings 0 and 1: 0.9995572567
Similarity between embeddings 0 and 2: 0.9999965429
Similarity between embeddings 0 and 3: 0.9995989203
Similarity between embeddings 0 and 4: 0.9999960661
Similarity between embeddings 0 and 5: 0.9994193912
Similarity between embeddings 0 and 6: 0.9995987415
Similarity between embeddings 0 and 7: 0.9994224906
Similarity between embeddings 0 and 8: 0.9996060133
Similarity between embeddings 0 and 9: 0.9999982715
Similarity between embeddings 0 and 10: 0.9996063113
Similarity between embeddings 0 and 11: 0.9999966025
Similarity between embeddings 0 and 12: 0.9999939799
Similarity between embeddings 0 and 13: 0.9999993443
Similarity between embeddings 0 and 14: 0.9999982119
Similarity between embeddings 0 and 15: 0.9999961257
Similarity between embeddings 0 and 16: 0.9999954700
Similarity between embeddings 0 and 17: 0.9993475080
Similarity between embeddings 0 and 18: 0.9994521737
Similarity between embeddings 0 and 19: 0.9993476272

So previous complaints were not addressed. This is a symptom now seen in all OpenAI models. It is close enough for all purposes except finding if things are identical or already exist.

As far as “format”, I use

"encoding_format": "base64"

and then decode the base64 and load the 32 bit vectors into floats without loss.


ada-002 and the v3 embeddings models have a 8192 tokens context length.

You will naturally have to split anything larger than that. What you actually do depends on the application and the destination of the large input.

1 Like