My First AI Paper Published on arXiv! (potential gender biases present in popular text embedding models)

vasyl · June 19, 2024, 6:00am

Hello fellow OpenAI Community!

I’m thrilled to share some exciting news with you all. My first paper on AI has been published on arXiv today!

Our research delves into the potential gender biases present in popular text embedding models, a topic of growing importance for developers and businesses utilizing AI technologies. We conducted an in-depth analysis to understand how these models associate professions with gendered terms, revealing significant biases that vary across different models.

I’m so proud to contribute (even though I understand how tiny its importance is) to this important area of AI research and provide insights that can help guide the ethical development and use of AI technologies.

You’re very welcome to read the full paper on arXiv and engage with me in discussing any findings, suggestions, or feedback https://arxiv.org/abs/2406.12138

zhukov.vladimir · June 19, 2024, 9:16am

Congratulations on the paper! Very good job!

rajatrocks · June 21, 2024, 5:41pm

Congrats! I remember in my last job in HR tech when we were looking at analyzing skills on resumes, the internal AI team warned us that word2vec and other embedding models had a lot of built in gender and race bias.

vasyl · June 23, 2024, 5:15am

Thanks a lot. Yep, that’s quite an old problem but the point is that it’s till unsolved and is present even among the state-of-the-art text embedding models.

jbayonne · June 24, 2024, 9:01pm

There’s nothing tiny about this contribution to the knowledge base…well done!

It’s essential this colossal topic be continuously addressed as AI itself touch nearly every domain of human experience…

vasyl · June 25, 2024, 10:56pm

Thanks a lot, @jbayonne !

PaulBellow · June 25, 2024, 11:03pm

Congrats!

Anything to share on the process with us?

What are you researching next?

vasyl · June 25, 2024, 11:18pm

Thanks a lot! We’re starting to work on a paper that would explore the positive / negative impact of applying TF-IDF for text embedding. It can save embedding costs and costs of storage but we don’t know how it translates into retrieval accuracy and latency. If we manage to find a good balance of low-relevant words to be filtered out of vector embedding, we expect it might really help save costs of embedding, storage, improve latency, maybe without any impact on accuracy or, who knows, maybe even improving the accuracy of information retrieval.

Would appreciate any thoughts / feedback on that matter.

curt.kennedy · June 26, 2024, 3:45pm

My approach is to use TF-IDF in tandem with embeddings. And also use multiple embedding engines in parallel, all with different weighting factors.

This should reduce any specific bias, since you are averaging, unless the bias is universal across all the retrieval lanes.

The TF-IDF can be used as a worst case backup, since it doesn’t require complicated inference or API’s. And yes, with TF-IDF you can set your own custom stopword cutoff, based on rarity within the corpus.

vasyl · June 28, 2024, 6:40am

Hi @curt.kennedy ! Thanks a lot for sharing! Could you please elaborate on using multiple embedding models in parallel? That sounds very interesting. Would it be something like using Cohere or OpenAI embedding in combination with BERT, for example, and then using both vector embeddings?

curt.kennedy · June 29, 2024, 12:28am

I have a write up over here:

vasyl · July 3, 2024, 9:34pm

Thanks for sharing, @curt.kennedy ! And what do you think about embedding the WoB and see if this embedding leg (of the keywords only) would be more efficient and maybe even accurate?

curt.kennedy · July 4, 2024, 1:30am

I haven’t tried embedding the Bag of Words (BoW). But the keyword approach might be less biased than the embedding vector.

So you could cascade, and take the keywords first (less biased) followed by semantics second (embeddings).

So you are filtering for low bias (with keywords), and then using semantics to sort out the remaining low bias retrievals.

The only problem is for shorter passages, without much keywords, and your keyword overlap would be low. So this would work best for larger passages, not short sentences or paragraphs that are sparse on keywords.

Topic		Replies	Views
Deep dive on Embedding Models Community embeddings	4	579	March 8, 2024
AI Alchemy: Navigating the Ethical Frontiers Community gpt-4 , chatgpt	1	1140	January 9, 2024
Using OpenAI Embeddings for article recommendations Community	2	1447	July 24, 2023
FeatureTranscribeAI - Have OpenAI code with the context of your codebase Community embeddings , project , api , assistants-api	0	616	April 5, 2024
Summarizing research papers Community	1	3889	May 7, 2024

My First AI Paper Published on arXiv! (potential gender biases present in popular text embedding models)

Related topics