Does anyone know if the ada embeddings are de-biased? if not, do we have any techniques to de-bias them before use
Thanks
Does anyone know if the ada embeddings are de-biased? if not, do we have any techniques to de-bias them before use
Thanks
There is a large vector bias and vector correlation with ada-002 embeddings.
You could increase your thresholds, or if you have a batch of embeddings, you could remove the bias and use PCA to spread them out, like I did here:
Thanks Curt, how did you get the embeddings file. Is the ADA embeddings available for download
You have to run each chunk of your text through the API to get the embedding vector (one at a time, or in batches). You then store this yourself, in a database or a flat file.
Once you have enough, you fit this set of data, save off your bias and fitting weights, and then use these for future embeddings (run code against each new embedding vector). So your database has “raw embedding” and “processed embedding”. Over time, you will have to re-fit and unbias the vectors, as you accumulate more and more raw embedding vectors.
I feel if open ai does this to their emdedding model, it will be easier for users(who may not have deep knowledge in this pace) to use it and that will help better adoption as bias is now a key concern. I am not sure if I am thinking in the right direction here.
I believe that OpenAI thinks that reducing dimensions using PCA could hurt performance. But from my testing, it seemed equivalent, and had better geometry (more isotropic).
Not sure how they were dropping dimensions, but I coded up the paper in the link here.