Splitting text into chunks versus reducing the text

Hi there! :wave:

I am trying to retrieve embeddings using “text-embedding-ada-002” for a text that uses 13,778 tokens, which is above the token limit for ada.

If I do split my text into chunks and use embedding_format (seen on EliahKagan’s repo) to make the vector deterministic, how will sending each part individually affect the “context” versus just making my text smaller?

Also, what does it mean for ChatGPT to encode embeddings deterministically? I know that it will produce the same vector for the same text; however, does that imply that format_embedding creates keeps all those embedding under one “conversation” like the chatbot and I can access it contextual aspect of this?

Let me know if I need rephrase any of my questions!

Thank you!!!

1 Like

Welcome to the community!

I don’t think I fully understand all of your questions, but let me add my 2c where I can:

  1. it should be noted that while ADA can theoretically support a 8kt context window, the reality is that really only the top part of that prompt will be sailent; you’d get maybe a paragraph or two in there, everything else will be more or less irrelevant to the embedding vector. text embedding 3 does a better job at handling bigger texts.

  2. not sure what you mean with “format_embedding” - but there’s no such thing as a conversation here - everything you send to the embedding endpoint will be evaluated independently. I’d always expect there to be some rounding errors or noise due to spacing or whatever. I don’t think it’s a good idea to expect or operate under the assumption of determinism with LLMs, even if it’s theoretically possible.

In my limited experience, the best thing you can do is to try to embed one concept at a time. If a whole chapter talks about the same thing, that’s great. but if two paragraphs talk about to completely different things, then I would try to not embed them as one chunk.

1 Like

Determinism means you always get the same output from the same input. The way you expect computer code to work, basically.

The current embeddings models are not deterministic. They do not produce the same output for the same input.


I performed 10 embeddings runs, on 3-small model, of the same 600 tokens of text, and got three unique results with tensor comparisons:


np.unique(em_ndarray, axis=0)
array([[ 0.01526077,  0.01427238,  0.06942467, ...,  0.00670789,
        -0.01064828, -0.01102387],
       [ 0.01529922,  0.01424501,  0.06947242, ...,  0.00671729,
        -0.01068046, -0.01100331],
       [ 0.01530029,  0.01423283,  0.06947729, ...,  0.00672765,
        -0.01066144, -0.01100408]], dtype=float32)

Similarity between embeddings 4 and 5: 1.0000000000
Similarity between embeddings 4 and 6: 0.9999994040
Similarity between embeddings 4 and 7: 1.0000000000
Similarity between embeddings 4 and 8: 0.9999997616
Similarity between embeddings 4 and 9: 0.9999997616…

With 3-large and 3092 dimensions, running 20 trials, none of them were the same.

Number of unique embeddings: 20
Similarity between embeddings 0 and 1: 0.9995572567
Similarity between embeddings 0 and 2: 0.9999965429
Similarity between embeddings 0 and 3: 0.9995989203
Similarity between embeddings 0 and 4: 0.9999960661
Similarity between embeddings 0 and 5: 0.9994193912
Similarity between embeddings 0 and 6: 0.9995987415
Similarity between embeddings 0 and 7: 0.9994224906
Similarity between embeddings 0 and 8: 0.9996060133
Similarity between embeddings 0 and 9: 0.9999982715
Similarity between embeddings 0 and 10: 0.9996063113
Similarity between embeddings 0 and 11: 0.9999966025
Similarity between embeddings 0 and 12: 0.9999939799
Similarity between embeddings 0 and 13: 0.9999993443
Similarity between embeddings 0 and 14: 0.9999982119
Similarity between embeddings 0 and 15: 0.9999961257
Similarity between embeddings 0 and 16: 0.9999954700
Similarity between embeddings 0 and 17: 0.9993475080
Similarity between embeddings 0 and 18: 0.9994521737
Similarity between embeddings 0 and 19: 0.9993476272

So previous complaints were not addressed. This is a symptom now seen in all OpenAI models. It is close enough for all purposes except finding if things are identical or already exist.

As far as “format”, I use

"encoding_format": "base64"

and then decode the base64 and load the 32 bit vectors into floats without loss.


ada-002 and the v3 embeddings models have a 8192 tokens context length.

You will naturally have to split anything larger than that. What you actually do depends on the application and the destination of the large input.

1 Like

The use of encoding_format should have zero effect on the actual performance of the embedding model. All usual considerations still apply.

https://platform.openai.com/docs/api-reference/embeddings/create#embeddings-create-encoding_format

1 Like

You could keep your full 13,778 token chunk in tact, but embed only pieces of it, and then return the full 13,778 token chunk that correlates to the smaller embedded piece.

This way your chunk is coherent and you can feed it into a large context model.

As stated, the AI may not fully digest the larger chunk, because the attention is diluted over this larger span of tokens, and some details may get glossed over. But at least you have a coherent chunk in your retrieval, and hopefully, over time, the attention mechanisms in the LLM’s will improve.

As for determinism, I’m not sure why you need this as long as the cosine similarities are within roundoff error. To make an embedding engine fully deterministic, would cause a massive latency hit, because all the arithmetic ops in the computer have to be done in a very specific synchronized sequence.

The reasoning here is that it is a little known fact that the distribution laws are not reliable with floating point numbers inside the computer, so in general:

A(B + C) \neq AB + AC

So this is relaxed to:

A(B + C) \approx AB + AC

Therefore it’s not deterministic.

This is why there will never be determinism, nor would I ever want or desire this, simply for the benefits of increased speed and lower latencies.

PS. These models also may be running at lower than 64 bits. They might be 16 or even 8 bits. This lower number of bits, in conjunction with parallelized non-synchronous arithmetic would also increase the observed non-determinism. Devoting smaller bits to the weights also improves latencies and reduces costs, but shows up as more non-determinism. So as long as this randomness isn’t drastically throwing off your rankings, it’s seen as a speed and cost benefit to you.

3 Likes

Hi, just out of curiosity, what is your goal if you’re embedding 13k token texts?..

Could you say that the similarities those embeddings (I’m assuming you used the same text) are close enough? If I wanted to compare large texts of data, how would I test this in a way where I could compare the embeddings?

Consider that the differences between various calls are 0.1% or 0.001% different in similarity scores. A very small amount.

It is enough that if you were looking for the top-5 chunks of documentation that match a question, some may switch positions, or the #5 result may occasionally change to something different if you were to perform the same search over and over, but in practical use, the quality of the AI in actually matching language and meaning is far less precise than the variance in scores.

That made sense and was very helpful!

In theory, would inputting different texts produce different embeddings? When you say ChatGPT is deterministic, does that mean that ChatGPT errors in producing the same results for the last number or so for the embedding coordinate?

In practice different text produces different embeddings. One token changing from " a" to " the" is a difference.