Universal lemmatization

juliushamilton100 · November 18, 2021, 3:17pm

This is a pretty ambitious idea and I would like to put it out there in case anybody has worked on this.

Is it possible that GPT-3 could be a near-universal lemmatization engine?

That is, provided any form / inflection of a word in a wide variety of common languages, could it provide the root or dictionary form?

dogs → dog
hablando → hablar
食べた → 食べる

I’ll be exploring this soon, but in case anybody knows about applications like this, or people who have worked on it or are experts in this, I’d love to hear more.

Thanks.

daveshapautomator · November 18, 2021, 3:30pm

This would be a perfect use-case for fine-tuning. That being said, why use something as powerful as GPT-3 when WordNet and NLTK can do this? What are the gaps you’re trying to cover?

juliushamilton100 · November 18, 2021, 4:49pm

From everything I’ve seen online so far WordNet is only for English, and nltk provides a good lemmatization method but I don’t think it’s multi-lingual. Did you have a multi-lingual use of NLTK in mind?

(A lot of hits have come up in Google Scholar and Google for “GPT-3 lemmatization” and “multi-lingual lemmatization” so I think I’ll find some good reading material on this.)

daveshapautomator · November 18, 2021, 4:51pm

Oh I see, you mean universal as in all languages. In that case, no. GPT-3 is trained mostly on English with only a few other languages leaking in.

juliushamilton100 · November 18, 2021, 5:25pm

But maybe if there’s a website where good quality datasets can be found for a wide variety of languages - or if there’s a single ubiquitously usable web crawler to generate a language-specific dataset for any language - then a system like GPT-3, maybe BERT, could be trained on each language’s dataset, and you could have a multi-lingual lemmatizer pretty easily. What do you think?

daveshapautomator · November 18, 2021, 5:36pm

You could try a fine-tuning set for multiple languages. Certainly you can probably hit the top 10 or 50 languages in the world but again, if the model hasn’t seen most of the vocabulary in a language then it will just be guessing.

juliushamilton100 · November 18, 2021, 7:08pm

I posted my question about easily getting data for any language here, if you’re interested: nlp - Is there a ubiquitous web crawler that can generate a good language-specific dataset for training a transformer? - Data Science Stack Exchange

But GPT-3 is trained on Common Crawl, and their website says 40+ languages. Is Common Crawl English-centric? I thought I’d seen elsewhere GPT-3 can translate between many languages well.

Are you sure GPT-3 really needs additional data for other languages?

Thanks very much

daveshapautomator · November 18, 2021, 7:28pm

What I’m remembering was that >90% of the total volume of training data was English. So yes, it may have seen other languages but I would not count on it having any solid grasp of the others. However, if that number is wrong then yeah, maybe you can train a lemmatizer for the languages it has seen enough of. However, if it hasn’t seen a language at all, then I suspect it won’t work.

Oh but also there’s no reason not to try it. I would say use DAVINCI INSTRUCT and just tell it to lemmatize your words.

DR.AMES · November 18, 2021, 7:35pm

What about using a Vision wash?

juliushamilton100 · November 18, 2021, 8:47pm

Sorry, what is that? I’ve never heard of it before.

DR.AMES · November 18, 2021, 9:17pm

Using a Vision API to extract the textual data’s (i.e., pdfs) algorithms. I used it with Oleo (Hawai’ian). The Vision API may be used to train a language model (Luca Pacioli, Divina Proportione). I’m not sure how Da Vinci got most of the credit here. See Euclid too. I like using Φ to validate my models for its symmetry. I suppose π would work too but this may add complexity. I think this is equal to Φ^2.

Ni · December 28, 2022, 11:50pm

Just based on trying out of the Chat GPT; I could suspect that lemmatization could work even ten times better. Not that I’m any expert on this field but I’m more than surprised how well rather small finnish language works already with Chat GPT demo.

Topic		Replies	Views
Fine-tuning GPT-3 on entire conversations to mimic style and extract relevant knowledge API	13	4988	December 16, 2023
GPT-3 for terminology extraction API	6	1354	October 18, 2021
GPT-3’s broad capabilities API	1	523	December 27, 2023
Limits and limits and limits API	2	1449	May 31, 2021
Fine-tuning GPT-3 for niche language API	7	3359	January 20, 2024

Universal lemmatization

Related topics