Different embeddings for exact same text

Recently I noticed that “text-embedding-ada-002” model gives different embedding vectors for the exact same input every time I hit the endpoint.

I was planning to use this model in some similarity and classification use cases which will allow me to compare any new incoming data with certain clusters or examples from data that were captured weeks or months ago.

Is “text-embedding-ada-002” reliable enough to be used for such use cases? Is there a way I can make the embeddings more deterministic?

Hi,

The differences are all in the noise floor, they make no appreciable difference to the semantic meaning that gets encoded.

There is an undocumented API parameter you can try to make the embeddings exactly the same, more over here:

However, oddly enough, recently we had a “bad embedding day” on Sept. 25. If your data was embedded on that day, it is recommended you either re-embed or discard that data.

2 Likes

Ahh, Interesting, must of missed that, cheers!

1 Like

Thank you for sharing this. This other post also has users reporting different embeddings even after using “encoder_format = float”. So I guess there is no solution for this at this time ?

There is a solution to the problem, I believe it’s mentioned in the thread.

But how it works is that you have a database of previously embedded results. The hash key into the database is the hash of the text you embedded.

So:

HashXYZ123 = hash(Text Blah)

So the database has in it rows like:

HashXYZ123 | EmbeddingVector(Text Blah)

So a new thing comes in, you hash it, you look to see if it’s in your database. If it is, you return the embedding vector from the database. If it is not in the database, you create a new embedding vector, and put it in the database with the new hash.

This way, you call the embedding operation over and over on the same thing, you get exactly the same result. Problem solved.

1 Like

Thanks. This makes sense. I hope though that for very similar (but not exactly same) input text, the similarity between embeddings will be greater than 0.99