OpenAI Embeddings - Multi language

It depends on the language . for example in Arabic , I just published a paper that shows that Microsoft E5 performed the best for an Arabic Dataset. for other languages it might be different . you can check MTEB leader board on HuggingFace

1 Like

Yes, I think when we learn, we are creating new relations in our brains … so linking verbal to words, words to images, images to memories, memories to thoughts, etc. So all this linking is occurring.

This linking in our brain is populating the ā€œlast hidden layerā€, which represents meaning, and semantics, but it’s much more than that too I think, since we are creating these internal knowledge graphs and associations.

Maybe all these things are considered semantics too? So it’s almost like once you get to the point of semantics, you are also at the point of understanding.

In the LLM weights, the semantics are forced via training data, and then after training it has perceived understanding.

But LLM’s aren’t understanding things to the level of humans (at least not yet), as they don’t have any multimodal cross linking going on, just statistical pattern matching from the training tokens. And no internal knowledge graph, unless these graph associations are also part of the training.

So this goes back to the whole training on reasoning thing, and not just tokens, which is basically burning in linkages from token vectors to other token vectors, creating mini knowledge graphs, and giving the model more reasoning powers.

But these higher level token mappings are still trained, and aren’t occurring naturally, as it seems to in the human brain.

I don’t know if this can be solved by intruding more logic based operations into the LLM. Like (A implies B) is the same as (not B implies not A), so contrastive training where you have a graph edge, but you also have the appropriate contrapositive graph edge going in the opposite direction. Maybe?

1 Like

Totally. I feel like a different person and even think differently in a different language. Which I think is one of the difficulties of learning a language through applications like Duolingo. Where it teaches us by trying to connect our native language to a foreign one, so instead of naturally speaking we are always trying to translate.

It’s for this reason it’s almost necessary to be with natives, and speak with natives to truly grasp a new language. It needs to be disassociated from your native language and settle in it’s own unique space.

It could be why it’s much easier for a child to learn two languages than an adult.

Learning a language by association will always be inefficient. It’s a great start, don’t get me wrong. There’s a certain point of maturity where the bird has to leave the nest and form it’s own though

I’m away from the computer but I’d like to see how the embedding model manages

Also, I think when we are giving misspellings to any model it’s worth looking at how it’s tokenized

Interesting article:

1 Like

The embedding model manages decently with a dot product of:
Dot Product: 0.8994434485396415

But GPT-4 was able to correct the errors, as evidenced here, so maybe use GPT-4 to spell check before embedding?

2 Likes

That’s pretty impressive ngl. I was messing around with misspellings and they are closer than synonyms.

The funny part is the text you used is part of an internet meme, but I think it has merit (it’s so easy to read).

I actually tried giving it just plainly to turbo-instruct and it just decided to continue the trend LOL

It’s like two babies talking to eachother and somehow having a coherent discussion.

It’s crazy looking at the tokens as well. Even though only ~28% of the tokens match from the scrambled vs unscrambled it still manages to score almost ~90%. The scrambled version is pretty much double the amount of tokens as well.

I asked ChatGPT4 to ā€œwordspin this (the unscrambled text) while perfectly preserving the context and semanticsā€, embedded it and got 93%

Hypothesis: Syntactically incorrect tokens are commonly encountered in misspellings and can be ā€œrecoveredā€ because of the common contexts they are found in. Beautifully … relatable that they cost double the processing power. Keyword is syntactically, I think. Because these tokens aren’t found elsewhere besides in misspellings.

4 Likes

The most disjointed embeddings are still around 0.60. That’s a ā€œDā€ grade at worst…

compare some stuff:

ā€œMy pet cat loves meā€,
ā€œMi gato mascota me quiereā€, # pet cat in spanish
ā€œRepublica celebration murderā€,
ā€œembeddings cosine dot productā€

0:ā€˜My pet cat loves me’ <==> 1:ā€˜Mi gato mascota me qui’:
cos: 0.8930911 dot: 0.8930912
0:ā€˜My pet cat loves me’ <==> 2:'Republica celebration ':
cos: 0.7155187 dot: 0.7155187
0:ā€˜My pet cat loves me’ <==> 3:'embeddings cosine dot ':
cos: 0.6884603 dot: 0.6884603
1:ā€˜Mi gato mascota me qui’ <==> 2:'Republica celebration ':
cos: 0.7176356 dot: 0.7176356
1:ā€˜Mi gato mascota me qui’ <==> 3:'embeddings cosine dot ':
cos: 0.6615954 dot: 0.6615955
2:'Republica celebration ’ <==> 3:'embeddings cosine dot ':
cos: 0.6643258 dot: 0.6643258

de queee hablassss

this is not true. even for the same language, except probably english and some european languages. usually you create embeding vectors using ada model, which is cheap and fast. but it weak with translations and foreign language understanding. this is from my experience.