I’m thrilled to share some exciting news with you all. My first paper on AI has been published on arXiv today!
Our research delves into the potential gender biases present in popular text embedding models, a topic of growing importance for developers and businesses utilizing AI technologies. We conducted an in-depth analysis to understand how these models associate professions with gendered terms, revealing significant biases that vary across different models.
I’m so proud to contribute (even though I understand how tiny its importance is) to this important area of AI research and provide insights that can help guide the ethical development and use of AI technologies.
You’re very welcome to read the full paper on arXiv and engage with me in discussing any findings, suggestions, or feedback https://arxiv.org/abs/2406.12138
Congrats! I remember in my last job in HR tech when we were looking at analyzing skills on resumes, the internal AI team warned us that word2vec and other embedding models had a lot of built in gender and race bias.
Thanks a lot. Yep, that’s quite an old problem but the point is that it’s till unsolved and is present even among the state-of-the-art text embedding models.
Thanks a lot! We’re starting to work on a paper that would explore the positive / negative impact of applying TF-IDF for text embedding. It can save embedding costs and costs of storage but we don’t know how it translates into retrieval accuracy and latency. If we manage to find a good balance of low-relevant words to be filtered out of vector embedding, we expect it might really help save costs of embedding, storage, improve latency, maybe without any impact on accuracy or, who knows, maybe even improving the accuracy of information retrieval.
Would appreciate any thoughts / feedback on that matter.
My approach is to use TF-IDF in tandem with embeddings. And also use multiple embedding engines in parallel, all with different weighting factors.
This should reduce any specific bias, since you are averaging, unless the bias is universal across all the retrieval lanes.
The TF-IDF can be used as a worst case backup, since it doesn’t require complicated inference or API’s. And yes, with TF-IDF you can set your own custom stopword cutoff, based on rarity within the corpus.
Hi @curt.kennedy ! Thanks a lot for sharing! Could you please elaborate on using multiple embedding models in parallel? That sounds very interesting. Would it be something like using Cohere or OpenAI embedding in combination with BERT, for example, and then using both vector embeddings?
Thanks for sharing, @curt.kennedy ! And what do you think about embedding the WoB and see if this embedding leg (of the keywords only) would be more efficient and maybe even accurate?
I haven’t tried embedding the Bag of Words (BoW). But the keyword approach might be less biased than the embedding vector.
So you could cascade, and take the keywords first (less biased) followed by semantics second (embeddings).
So you are filtering for low bias (with keywords), and then using semantics to sort out the remaining low bias retrievals.
The only problem is for shorter passages, without much keywords, and your keyword overlap would be low. So this would work best for larger passages, not short sentences or paragraphs that are sparse on keywords.