Creating a Dataset for Translating for Indigenous Languages in Latin America

Hello,

I am looking for people to contribute to our non-profit project that will initially serve as a reference for future translations of indigenous languages in Latin America.

The project is called Yanomai, I am on my own and am in the phase of processing the dataset, filtering the collected data and preparing it for use.

Interested, reply or message me privately.

Hi. Me interesa.

Can you give me more details?

2 Likes

Hello @amaotox

I would love to share in a public way all the details of the project and the current scenario but not right now, maybe in late December.

In that sense, I will send the details and my intention to contact you in your private, right?

Thank you very much and all help will always be very welcome, especially for the hundreds of other indigenous languages of the Latin American continent. Three months ago I had the idea of bringing something to the scenario in my country, Brazil, in parallel to what we are developing to be commercialized, which is content created with GPT-3 and I came up with the idea of trying to connect Indians with AI, it was then that I discovered that GPT-3 didn’t understand Tupi-guarani, one of the main indigenous languages spoken in Brazil and in several other countries of the continent.

Hugs!

1 Like

Just to remember:

http://www.etnolinguistica.org/biblio:carvalho-1987-dicionario

We are using this dictionary model. This reply is to say that this is not a top secret Martian project and that I was very careful to start this part preserving as many records as I could.

My difficulty now is the readjustment of the strokes and signs that exist in the typography. Although the book has been scanned with OCR (the administrator responsible for maintaining the repository and library has scanned dozens of books and used OCR on other books as well).

I was able to use nltk, pandas and numpy in some tests I did but I have not yet quantified how many terms I will lose if I choose a template without the words or letters without apparent encoding.

image

@amaotox

obs: The codes were created from scratch with the help of copilot.

Just for Update:

1 - Incluindo Línguas Indígenas nos Modelos GPT-3 e GPT-4: Conheça o Projeto Yanomai- Parte 1 - YouTube

image

I crate 3 videos talking about the Yanomai =D