Discrepancy in embeddings precision

lucasadams · April 21, 2023, 1:48pm

Hmm ok so something interesting. Yesterday I went and tested getting embeddings using the openai python library with the default settings. As suggested in this thread, embedding the same text twice results in slightly different embeddings. The cosine sim between the two embeddings was ~0.999. I then used encoding_format="float" which overrides the default of base64 and lo and behold embedding the same text twice resulted in identical vectors. So I changed to use that in my code. However, I went back this morning to try and figure out whether the small error in the default method was coming from openai’s servers or some issue in the python library, and when I re-tested using the default settings (which use base64), now this morning i get the same vector for the same text. So today it seems like it is fixed. I used the same text and settings as yesterday. My guess is either this was actually fixed between yesterday and today or the discrepancy is actually semi random and transient, which would be weird. Anyway I guess I’d recommend using float as the encoding_format but we’d need more testing to be able to be sure. Would be great to get someone from openai to look into this.

Topic		Replies	Views
Non-deterministic embedding results using text-embedding-ada-002 API	7	5128	December 24, 2023
Can text-embedding-ada-002 be made deterministic? API embeddings , ada	18	7156	December 24, 2023
Why `OpenAI Embedding` return different vectors for the same text input? API	35	9397	April 30, 2024
Splitting text into chunks versus reducing the text API embeddings , ada	9	2117	April 5, 2024
Different embeddings for exact same text API embeddings	7	3323	December 18, 2023

Discrepancy in embeddings precision

Related topics