Hello fellow language enthusiasts and AI experts,
I am currently working on a project to fine-tune a GPT model for direct translations of an old indigenous language, specifically Mapudungun, which is spoken by the Mapuche people in Chile and Argentina. I am seeking advice on the best ways to create a dataset for fine-tuning and whether the available resources are sufficient for training the model.
I came across a dictionary available at the following URL, which provides translations between Mapudungun and Spanish:
My questions are:
- Is this dictionary sufficient to create a dataset for fine-tuning the GPT model, or do I need additional resources?
- What is the best approach to creating a high-quality dataset for training the model, considering the limited resources available for this language?
- Has anyone here worked on a similar project, or seen examples of fine-tuning GPT for direct translations of lesser-known or old indigenous languages? If so, please share your experiences and insights.
I appreciate any suggestions or guidance on this topic, as I believe that preserving and promoting the use of indigenous languages is of great cultural importance. I look forward to hearing from you and learning from your expertise.
Thank you in advance!