Universal lemmatization

This is a pretty ambitious idea and I would like to put it out there in case anybody has worked on this.

Is it possible that GPT-3 could be a near-universal lemmatization engine?

That is, provided any form / inflection of a word in a wide variety of common languages, could it provide the root or dictionary form?

dogs → dog
hablando → hablar
食べた → 食べる

I’ll be exploring this soon, but in case anybody knows about applications like this, or people who have worked on it or are experts in this, I’d love to hear more.


1 Like

This would be a perfect use-case for fine-tuning. That being said, why use something as powerful as GPT-3 when WordNet and NLTK can do this? What are the gaps you’re trying to cover?

1 Like

From everything I’ve seen online so far WordNet is only for English, and nltk provides a good lemmatization method but I don’t think it’s multi-lingual. Did you have a multi-lingual use of NLTK in mind?

(A lot of hits have come up in Google Scholar and Google for “GPT-3 lemmatization” and “multi-lingual lemmatization” so I think I’ll find some good reading material on this.)

Oh I see, you mean universal as in all languages. In that case, no. GPT-3 is trained mostly on English with only a few other languages leaking in.

1 Like

But maybe if there’s a website where good quality datasets can be found for a wide variety of languages - or if there’s a single ubiquitously usable web crawler to generate a language-specific dataset for any language - then a system like GPT-3, maybe BERT, could be trained on each language’s dataset, and you could have a multi-lingual lemmatizer pretty easily. What do you think?

You could try a fine-tuning set for multiple languages. Certainly you can probably hit the top 10 or 50 languages in the world but again, if the model hasn’t seen most of the vocabulary in a language then it will just be guessing.

1 Like

I posted my question about easily getting data for any language here, if you’re interested: nlp - Is there a ubiquitous web crawler that can generate a good language-specific dataset for training a transformer? - Data Science Stack Exchange

But GPT-3 is trained on Common Crawl, and their website says 40+ languages. Is Common Crawl English-centric? I thought I’d seen elsewhere GPT-3 can translate between many languages well.

Are you sure GPT-3 really needs additional data for other languages?

Thanks very much

What I’m remembering was that >90% of the total volume of training data was English. So yes, it may have seen other languages but I would not count on it having any solid grasp of the others. However, if that number is wrong then yeah, maybe you can train a lemmatizer for the languages it has seen enough of. However, if it hasn’t seen a language at all, then I suspect it won’t work.

Oh but also there’s no reason not to try it. I would say use DAVINCI INSTRUCT and just tell it to lemmatize your words.

1 Like

What about using a Vision wash?

Sorry, what is that? I’ve never heard of it before.

Using a Vision API to extract the textual data’s (i.e., pdfs) algorithms. I used it with Oleo (Hawai’ian). The Vision API may be used to train a language model (Luca Pacioli, Divina Proportione). I’m not sure how Da Vinci got most of the credit here. See Euclid too. I like using Φ to validate my models for its symmetry. I suppose π would work too but this may add complexity. I think this is equal to Φ^2.