Different embeddings for exact same text

Mayank11 · October 3, 2023, 3:40pm

Recently I noticed that “text-embedding-ada-002” model gives different embedding vectors for the exact same input every time I hit the endpoint.

I was planning to use this model in some similarity and classification use cases which will allow me to compare any new incoming data with certain clusters or examples from data that were captured weeks or months ago.

Is “text-embedding-ada-002” reliable enough to be used for such use cases? Is there a way I can make the embeddings more deterministic?

Foxalabs · October 3, 2023, 3:47pm

Hi,

The differences are all in the noise floor, they make no appreciable difference to the semantic meaning that gets encoded.

curt.kennedy · October 3, 2023, 6:26pm

There is an undocumented API parameter you can try to make the embeddings exactly the same, more over here:

However, oddly enough, recently we had a “bad embedding day” on Sept. 25. If your data was embedded on that day, it is recommended you either re-embed or discard that data.

Foxalabs · October 3, 2023, 7:13pm

Ahh, Interesting, must of missed that, cheers!

Mayank11 · October 4, 2023, 8:32am

Thank you for sharing this. This other post also has users reporting different embeddings even after using “encoder_format = float”. So I guess there is no solution for this at this time ?

curt.kennedy · October 4, 2023, 5:34pm

There is a solution to the problem, I believe it’s mentioned in the thread.

But how it works is that you have a database of previously embedded results. The hash key into the database is the hash of the text you embedded.

So:

HashXYZ123 = hash(Text Blah)

So the database has in it rows like:

HashXYZ123 | EmbeddingVector(Text Blah)

So a new thing comes in, you hash it, you look to see if it’s in your database. If it is, you return the embedding vector from the database. If it is not in the database, you create a new embedding vector, and put it in the database with the new hash.

This way, you call the embedding operation over and over on the same thing, you get exactly the same result. Problem solved.

Mayank11 · October 5, 2023, 5:22pm

Thanks. This makes sense. I hope though that for very similar (but not exactly same) input text, the similarity between embeddings will be greater than 0.99

Topic		Replies	Views
Are vectors generated by text-embedding-3-small always the same for the same text input? API embeddings	3	645	May 8, 2024
Does openai Question embeddings change everytime? API api-embedding	1	117	October 7, 2024
Non-deterministic embedding models? API	1	1281	February 18, 2024
Splitting text into chunks versus reducing the text API embeddings , ada	9	2117	April 5, 2024
Embeddings for the same content vary. How can this be fixed? API embeddings	2	375	May 24, 2024

Different embeddings for exact same text

Related topics