De-biasing of ada embeddings

joyasree78 · June 2, 2023, 4:55pm

Does anyone know if the ada embeddings are de-biased? if not, do we have any techniques to de-bias them before use

Thanks

curt.kennedy · June 2, 2023, 5:00pm

There is a large vector bias and vector correlation with ada-002 embeddings.

You could increase your thresholds, or if you have a batch of embeddings, you could remove the bias and use PCA to spread them out, like I did here:

joyasree78 · June 2, 2023, 5:35pm

Thanks Curt, how did you get the embeddings file. Is the ADA embeddings available for download

curt.kennedy · June 2, 2023, 5:49pm

You have to run each chunk of your text through the API to get the embedding vector (one at a time, or in batches). You then store this yourself, in a database or a flat file.

Once you have enough, you fit this set of data, save off your bias and fitting weights, and then use these for future embeddings (run code against each new embedding vector). So your database has “raw embedding” and “processed embedding”. Over time, you will have to re-fit and unbias the vectors, as you accumulate more and more raw embedding vectors.

joyasree78 · June 2, 2023, 6:01pm

I feel if open ai does this to their emdedding model, it will be easier for users(who may not have deep knowledge in this pace) to use it and that will help better adoption as bias is now a key concern. I am not sure if I am thinking in the right direction here.

curt.kennedy · June 2, 2023, 6:12pm

I believe that OpenAI thinks that reducing dimensions using PCA could hurt performance. But from my testing, it seemed equivalent, and had better geometry (more isotropic).

Not sure how they were dropping dimensions, but I coded up the paper in the link here.

Topic		Replies	Views
Embedding Results Scale Seems Off API embeddings , ada	8	4931	December 24, 2023
Is it possible to fine tune the embedding model? API	20	18560	March 29, 2024
Question on text-embedding-ada-002 API	12	6302	December 24, 2023
Embedding model's dimension API chatgpt	16	14396	July 31, 2023
My First AI Paper Published on arXiv! (potential gender biases present in popular text embedding models) Community embeddings , project , research , openai	12	898	July 4, 2024

De-biasing of ada embeddings

Related topics