Creating a Dataset for Translating for Indigenous Languages in Latin America

joseicarobc · November 11, 2022, 4:25am

Hello,

I am looking for people to contribute to our non-profit project that will initially serve as a reference for future translations of indigenous languages in Latin America.

The project is called Yanomai, I am on my own and am in the phase of processing the dataset, filtering the collected data and preparing it for use.

Interested, reply or message me privately.

amaotox · November 12, 2022, 4:36am

Hi. Me interesa.

Can you give me more details?

joseicarobc · November 12, 2022, 6:19am

Hello @amaotox

I would love to share in a public way all the details of the project and the current scenario but not right now, maybe in late December.

In that sense, I will send the details and my intention to contact you in your private, right?

Thank you very much and all help will always be very welcome, especially for the hundreds of other indigenous languages of the Latin American continent. Three months ago I had the idea of bringing something to the scenario in my country, Brazil, in parallel to what we are developing to be commercialized, which is content created with GPT-3 and I came up with the idea of trying to connect Indians with AI, it was then that I discovered that GPT-3 didn’t understand Tupi-guarani, one of the main indigenous languages spoken in Brazil and in several other countries of the continent.

Hugs!

joseicarobc · November 12, 2022, 10:04am

Just to remember:

http://www.etnolinguistica.org/biblio:carvalho-1987-dicionario

We are using this dictionary model. This reply is to say that this is not a top secret Martian project and that I was very careful to start this part preserving as many records as I could.

My difficulty now is the readjustment of the strokes and signs that exist in the typography. Although the book has been scanned with OCR (the administrator responsible for maintaining the repository and library has scanned dozens of books and used OCR on other books as well).

I was able to use nltk, pandas and numpy in some tests I did but I have not yet quantified how many terms I will lose if I choose a template without the words or letters without apparent encoding.

@amaotox

obs: The codes were created from scratch with the help of copilot.

joseicarobc · April 20, 2023, 5:18pm

Just for Update:

1 - Incluindo Línguas Indígenas nos Modelos GPT-3 e GPT-4: Conheça o Projeto Yanomai- Parte 1 - YouTube

I crate 3 videos talking about the Yanomai =D

Shadloom · May 19, 2025, 7:09pm

Hello! We are an overseas practice team from Tsinghua University, China. We will be conducting fieldwork on the digital preservation of Indigenous languages in Brazil during early August 2025, focusing on digital documentation of Indigenous languages, community-driven language data collection, or AI-assisted speech recognition, which are highly relevant to your work. Could you kindly let me know the current status of this project, and whether you would be willing to engage in further communication with our team?

Shobha_Ramani · July 9, 2025, 12:43pm

Dear Jose
We were delighted when we came across your drive to create a dataset for Indigenous languages in Latin America.

I represent Hecho Por Nosotros (HXN), a NGO based in Buenos Aires which is exploring a similar project.

I also take this opportunity to invite you to join us for the online side event hosted by HXN - the UN High-Level Political Forum 2025 session on Jul 14, 2025

The event is focused on various topics around How AI can help artisanal communities. It would be great if you could join Nelly P Garcia-Lopez., Assistant Professor, Universidad de los Andes, and a few more experts in an interactive breakaway session where we are addressing AI solutions to preserve indigenous wisdom , including traditional creative art forms

Apologies for the short notice, but I just got your links in a google search. ( this is kinda a shout into the void!) Please do share an email address where I can share more details

Looking forward to hearing from you
Thank you
Shobha Ramani
shobha.ramani.hxn@gmail.com

Topic		Replies	Views
OpenAI and Indigenous Language Community	5	894	July 9, 2025
Trying to prevent the extinction of Indigenous languages Community gpt-4	3	452	April 18, 2024
Build an environmentalist social network with GPT 3 Community	6	1025	December 22, 2022
Fine-tuning GPT for Direct Translations of an Old Indigenous Language - Seeking Advice API	2	806	May 19, 2024
Request to add Oshiwambo to OpenAI's language Model Community languages	2	192	July 12, 2024

Creating a Dataset for Translating for Indigenous Languages in Latin America

Related topics