It depends on the language . for example in Arabic , I just published a paper that shows that Microsoft E5 performed the best for an Arabic Dataset. for other languages it might be different . you can check MTEB leader board on HuggingFace
Yes, I think when we learn, we are creating new relations in our brains ⦠so linking verbal to words, words to images, images to memories, memories to thoughts, etc. So all this linking is occurring.
This linking in our brain is populating the ālast hidden layerā, which represents meaning, and semantics, but itās much more than that too I think, since we are creating these internal knowledge graphs and associations.
Maybe all these things are considered semantics too? So itās almost like once you get to the point of semantics, you are also at the point of understanding.
In the LLM weights, the semantics are forced via training data, and then after training it has perceived understanding.
But LLMās arenāt understanding things to the level of humans (at least not yet), as they donāt have any multimodal cross linking going on, just statistical pattern matching from the training tokens. And no internal knowledge graph, unless these graph associations are also part of the training.
So this goes back to the whole training on reasoning thing, and not just tokens, which is basically burning in linkages from token vectors to other token vectors, creating mini knowledge graphs, and giving the model more reasoning powers.
But these higher level token mappings are still trained, and arenāt occurring naturally, as it seems to in the human brain.
I donāt know if this can be solved by intruding more logic based operations into the LLM. Like (A implies B) is the same as (not B implies not A), so contrastive training where you have a graph edge, but you also have the appropriate contrapositive graph edge going in the opposite direction. Maybe?
Totally. I feel like a different person and even think differently in a different language. Which I think is one of the difficulties of learning a language through applications like Duolingo. Where it teaches us by trying to connect our native language to a foreign one, so instead of naturally speaking we are always trying to translate.
Itās for this reason itās almost necessary to be with natives, and speak with natives to truly grasp a new language. It needs to be disassociated from your native language and settle in itās own unique space.
It could be why itās much easier for a child to learn two languages than an adult.
Learning a language by association will always be inefficient. Itās a great start, donāt get me wrong. Thereās a certain point of maturity where the bird has to leave the nest and form itās own though
Iām away from the computer but Iād like to see how the embedding model manages
Also, I think when we are giving misspellings to any model itās worth looking at how itās tokenized
Interesting article:
The embedding model manages decently with a dot product of:
Dot Product: 0.8994434485396415
But GPT-4 was able to correct the errors, as evidenced here, so maybe use GPT-4 to spell check before embedding?
Thatās pretty impressive ngl. I was messing around with misspellings and they are closer than synonyms.
The funny part is the text you used is part of an internet meme, but I think it has merit (itās so easy to read).
I actually tried giving it just plainly to turbo-instruct and it just decided to continue the trend LOL
Itās like two babies talking to eachother and somehow having a coherent discussion.
Itās crazy looking at the tokens as well. Even though only ~28% of the tokens match from the scrambled vs unscrambled it still manages to score almost ~90%. The scrambled version is pretty much double the amount of tokens as well.
I asked ChatGPT4 to āwordspin this (the unscrambled text) while perfectly preserving the context and semanticsā, embedded it and got 93%
Hypothesis: Syntactically incorrect tokens are commonly encountered in misspellings and can be ārecoveredā because of the common contexts they are found in. Beautifully ⦠relatable that they cost double the processing power. Keyword is syntactically, I think. Because these tokens arenāt found elsewhere besides in misspellings.
The most disjointed embeddings are still around 0.60. Thatās a āDā grade at worstā¦
compare some stuff:
āMy pet cat loves meā,
āMi gato mascota me quiereā, # pet cat in spanish
āRepublica celebration murderā,
āembeddings cosine dot productā
0:āMy pet cat loves meā <==> 1:āMi gato mascota me quiā:
cos: 0.8930911 dot: 0.8930912
0:āMy pet cat loves meā <==> 2:'Republica celebration ':
cos: 0.7155187 dot: 0.7155187
0:āMy pet cat loves meā <==> 3:'embeddings cosine dot ':
cos: 0.6884603 dot: 0.6884603
1:āMi gato mascota me quiā <==> 2:'Republica celebration ':
cos: 0.7176356 dot: 0.7176356
1:āMi gato mascota me quiā <==> 3:'embeddings cosine dot ':
cos: 0.6615954 dot: 0.6615955
2:'Republica celebration ā <==> 3:'embeddings cosine dot ':
cos: 0.6643258 dot: 0.6643258
this is not true. even for the same language, except probably english and some european languages. usually you create embeding vectors using ada model, which is cheap and fast. but it weak with translations and foreign language understanding. this is from my experience.


