Poor embedding performance using ada for portuguse

christiancleber · October 18, 2023, 11:45am

I created two embeddings using ADA for two words in Portuguese: cimento (cement) and sorvete (ice cream). The cosine similarity between them was 0.8 what is clearly wrong. Any thoughts?

_j · October 18, 2023, 12:35pm

For single word embeddings in a distinct language, I think you picked the wrong model. ada is best for comparing larger texts and user inputs to texts.

The best performance for semantic word embeddings there is reported as glove 600. You can see how to employ the Gensim model and obtain single word vectors here:

christiancleber · October 18, 2023, 1:15pm

Just tested for larger texts and the results are still bad.

Using - stjiris/bert-large-portuguese-cased-legal-tsdae-gpl-nli-sts-MetaKD-v0 which is a Sentence Transformer for Portuguese, the results are consistent for single words or sentences.

I would expect that OpenAI Sentence Transformer would be far better than anyone else.

curt.kennedy · October 18, 2023, 2:18pm

Even in English, you will get a cosine similarity of 0.8 for non-related things.

I have talked about this problem to death. So there are two solutions.

Easy: Tighten your bounds for similarity.
Hard: Post process the embedding vectors, removing the correlation and biases in the model that are causing this.

This is not a Portuguese thing! It’s an Ada-002 thing!

christiancleber · October 18, 2023, 3:09pm

Yes, that’s exactly what I got. 0.8 for any kind of comparison.

Thanks for clarifying!

I see a lot of tutorials teaching how to do RAG (with OpenAI) and the basic principle is to search a vector database to filter relevant text associated to the question. This search is made via semantic similarity. How this can work reasonably well given this cosine similarity behavior.

curt.kennedy · October 18, 2023, 3:19pm

It works for most because folks aren’t paying attention to the correlation value. They are only grabbing the top K highest scoring things. So the non-isotropic behavior of the model is swept under the rug.

They aren’t looking for de-correlated things, or orthogonal things. Or measuring how uncorrelated your “top K” results really are. Another reason is that, with RAG, they feed the top K answers into the LLM, and let the LLM decide if it’s related or not.

So because the LLM can also sort out the non-correlated results, it also sweeps the issue under the rug.

There are many theories on why this is happening in the model, like overcompensating for hidden states. And it seems to happen in most models, but it is really bad in ada-002.

Here is the paper discussing this:

Also, here is code that I implemented “ABTT” that essentially de-biases and de-correlates your embeddings.

christiancleber · October 18, 2023, 7:49pm

Thanks a lot Curt.

I will definitely take a look on those papers.

Topic		Replies	Views
Does ada support other languages than English? API embeddings , question	13	12965	October 18, 2023
Embedding Results Scale Seems Off API embeddings , ada	8	5159	December 24, 2023
Question on text-embedding-ada-002 API	12	6440	December 24, 2023
De-biasing of ada embeddings API	5	1306	June 2, 2023
Quality of embeddings using davinci-001 embeddings model vs. ada-002 model API embeddings	15	4181	April 9, 2024

Poor embedding performance using ada for portuguse

Related topics