Is there a list somewhere of the human languages supported by text-embedding-ada-002?
In this article, Revolutionizing Natural Language Processing: OpenAI’s ADA-002 Model Takes the Stage | by Jen Codes | Medium.
“It has been trained on a diverse set of languages, including English, Spanish, French, and Chinese, and has shown impressive results in tasks such as cross-lingual transfer learning.”
But, have so far been unable to find a list anywhere.
I don’t think there is a definitive list, mainly because that would suppose there is some definitive list on the datasets it was trained on, it’s all of the languages in the training set, which would typically include most commonly spoken (online).
The entire field of NLP is new, we are the ones making the text books, the quick guides and the lists. This could be a great side project for someone to do, a linguistic embedding performance evaluation by language.
I did find this, so I’m rolling with it for now: List of languages supported by ChatGPT | Botpress Blog
I have tested Spanish, Chinese and Korean and gotten good results. And, by good, I mean results you would expect from a machine translating prompts from English (or non-English) to English (for vectorizing) and then interpreting the resulting documents in order to translate them back to non-English language. That would be tough for an experienced translator, let alone a machine that has little sense of grammar, semantics, phraseology, idioms, etc…
I can add Portuguese, French, Spanish, Italian, Mandarin, German. Have tested all of these and using them live with embedding