Early April, I ran around 100 (Dutch) sentences through text-embedding-ada-002-v2, applied a clustering method, and got decent results. We didn’t get around to working further on it until recently, and I ran the same sentences again, this time using Azure, which claims to use the same model. The results were rather different, and frankly, considerably worse. Then I ran the same sentences through OpenAI again, and found that the results were much worse than in April. In the first sentence I picked, the sign had changed in 30 places, and the ratio between the vector elements was a factor 54 in one place.
Has there been a change in the model, even though the name is still the same? I read about non-deterministic output, but I don’t think that can explain the differences in clustering results I saw.
I would think OpenAI would be smart enough to know that changing the performance of an embedding model would be a breaking change for many applications. Keyword “I would think”.
I did a bit of analysis just on two different runs, knowing that the model is now non-deterministic and simply won’t make the same thing 4 out of 5 times. Embedding the first page of the GPT4 paper:
Number of sign flips: 4
Sign flip: 7.541279046563432e-05 to -2.7209505788050592e-05
Sign flip: -8.91096715349704e-05 to 4.362581603345461e-05
Sign flip: -3.797286990447901e-05 to 0.000126959930639714
Sign flip: -0.0001502926170360297 to 8.621229426353239e-06
Minimum percentage difference: 0.01%
Average percentage difference: 1.82%
Maximum percentage difference: 148.96%
So you have to compare how your previous embeddings stack up just to the randomness you get back out of the thing - to disguise such model changes, or to force you to average 10 runs to converge on an embeddings value.