OpenAI Embeddings - Multi language

hazem.azim · December 14, 2023, 4:40pm

It depends on the language . for example in Arabic , I just published a paper that shows that Microsoft E5 performed the best for an Arabic Dataset. for other languages it might be different . you can check MTEB leader board on HuggingFace

curt.kennedy · December 14, 2023, 5:38pm

Yes, I think when we learn, we are creating new relations in our brains … so linking verbal to words, words to images, images to memories, memories to thoughts, etc. So all this linking is occurring.

This linking in our brain is populating the “last hidden layer”, which represents meaning, and semantics, but it’s much more than that too I think, since we are creating these internal knowledge graphs and associations.

Maybe all these things are considered semantics too? So it’s almost like once you get to the point of semantics, you are also at the point of understanding.

In the LLM weights, the semantics are forced via training data, and then after training it has perceived understanding.

But LLM’s aren’t understanding things to the level of humans (at least not yet), as they don’t have any multimodal cross linking going on, just statistical pattern matching from the training tokens. And no internal knowledge graph, unless these graph associations are also part of the training.

So this goes back to the whole training on reasoning thing, and not just tokens, which is basically burning in linkages from token vectors to other token vectors, creating mini knowledge graphs, and giving the model more reasoning powers.

But these higher level token mappings are still trained, and aren’t occurring naturally, as it seems to in the human brain.

I don’t know if this can be solved by intruding more logic based operations into the LLM. Like (A implies B) is the same as (not B implies not A), so contrastive training where you have a graph edge, but you also have the appropriate contrapositive graph edge going in the opposite direction. Maybe?

anon10827405 · December 14, 2023, 6:12pm

Totally. I feel like a different person and even think differently in a different language. Which I think is one of the difficulties of learning a language through applications like Duolingo. Where it teaches us by trying to connect our native language to a foreign one, so instead of naturally speaking we are always trying to translate.

It’s for this reason it’s almost necessary to be with natives, and speak with natives to truly grasp a new language. It needs to be disassociated from your native language and settle in it’s own unique space.

It could be why it’s much easier for a child to learn two languages than an adult.

Learning a language by association will always be inefficient. It’s a great start, don’t get me wrong. There’s a certain point of maturity where the bird has to leave the nest and form it’s own though

I’m away from the computer but I’d like to see how the embedding model manages

Also, I think when we are giving misspellings to any model it’s worth looking at how it’s tokenized

Interesting article:

curt.kennedy · December 14, 2023, 9:37pm

The embedding model manages decently with a dot product of:
Dot Product: 0.8994434485396415

But GPT-4 was able to correct the errors, as evidenced here, so maybe use GPT-4 to spell check before embedding?

anon10827405 · December 14, 2023, 10:01pm

That’s pretty impressive ngl. I was messing around with misspellings and they are closer than synonyms.

The funny part is the text you used is part of an internet meme, but I think it has merit (it’s so easy to read).

I actually tried giving it just plainly to turbo-instruct and it just decided to continue the trend LOL

It’s like two babies talking to eachother and somehow having a coherent discussion.

It’s crazy looking at the tokens as well. Even though only ~28% of the tokens match from the scrambled vs unscrambled it still manages to score almost ~90%. The scrambled version is pretty much double the amount of tokens as well.

I asked ChatGPT4 to “wordspin this (the unscrambled text) while perfectly preserving the context and semantics”, embedded it and got 93%

Hypothesis: Syntactically incorrect tokens are commonly encountered in misspellings and can be “recovered” because of the common contexts they are found in. Beautifully … relatable that they cost double the processing power. Keyword is syntactically, I think. Because these tokens aren’t found elsewhere besides in misspellings.

_j · December 15, 2023, 12:48am

The most disjointed embeddings are still around 0.60. That’s a “D” grade at worst…

compare some stuff:

“My pet cat loves me”,
“Mi gato mascota me quiere”, # pet cat in spanish
“Republica celebration murder”,
“embeddings cosine dot product”

0:‘My pet cat loves me’ <==> 1:‘Mi gato mascota me qui’:
cos: 0.8930911 dot: 0.8930912
0:‘My pet cat loves me’ <==> 2:'Republica celebration ':
cos: 0.7155187 dot: 0.7155187
0:‘My pet cat loves me’ <==> 3:'embeddings cosine dot ':
cos: 0.6884603 dot: 0.6884603
1:‘Mi gato mascota me qui’ <==> 2:'Republica celebration ':
cos: 0.7176356 dot: 0.7176356
1:‘Mi gato mascota me qui’ <==> 3:'embeddings cosine dot ':
cos: 0.6615954 dot: 0.6615955
2:'Republica celebration ’ <==> 3:'embeddings cosine dot ':
cos: 0.6643258 dot: 0.6643258

anon10827405 · December 15, 2023, 12:50am

de queee hablassss

arthurn · December 17, 2023, 10:28am

this is not true. even for the same language, except probably english and some european languages. usually you create embeding vectors using ada model, which is cheap and fast. but it weak with translations and foreign language understanding. this is from my experience.

Topic		Replies	Views
Does ada support other languages than English? API embeddings , question	13	13359	October 18, 2023
Quality of embeddings using davinci-001 embeddings model vs. ada-002 model API embeddings	15	4464	April 9, 2024
Expected Angular Differences in Embedding Random Text? API	9	1216	December 24, 2023
Question on text-embedding-ada-002 API	12	6569	December 24, 2023
Splitting text into chunks versus reducing the text API embeddings , ada	9	3117	April 5, 2024

OpenAI Embeddings - Multi language

Related topics