Best practices for a unique translation task: Old Hawaiian Text to Modern Hawaiian Text

I’ve been working on a side project for awhile to create a model that is able to take in text written in the 1900s in Hawaiian language and translate it into a format modern readers can read it in (with all of the appropriate punctuation). It’s not trivial because adding the punctuation can change the meaning of the words and many words have multiple ways to be represented. It’s kind of a machine translation task, but to two very similar languages. I’ve had some success training my own model but have hit a wall and I was thinking that I could use ChatGPT to perform the task for the users.

I’ve tried fine-tuning with a largish data set (10+k sentence examples) but it wasn’t great, and I’ve tried prompting ChatGPT 4 and it’s closeish but still makes quite a few mistakes. So my question is, should I continue to try fine-tuning, try more/different prompts, or explore other tactics like embeddings?

Would love any advice, thanks.

3 Likes

First of all, welcome to the forum!

You’re in the right place, I’m sure someone here will be able to help you start to get better results.

I do not have any experience with translation tasks but I might be able to give you some general guidance with fine-tuning.

If you have 10,000+ sentence pairs, then you likely have more than enough data to train on.

With respect to your training data, it’s critically important that the structure of the prompt:response pairs in the training set match that of how the fine-tuned model will be used.

So if your training data is,

Prompt: Translate this sentence from Old Hawaiian to Modern Hawaiian:
Xxxxxxxxxx xxx xxxxx xxxxxx xxxx
###
Response: Yyyyyyy yyyy Yyyyyyy yyyyy
###

Then that is exactly the structure you would want to use when using the model.

Here, I would like to note that fine-tuning is generally used to teach the model how to behave, it’s not a good, efficient, or reliable method for imparting new knowledge for the model to use.

The problem you’ll be facing is that (I’m assuming) both Old Hawaiian and Modern Hawaiian are “low resource” languages, so the models will have very limited knowledge of them from its training data.

You can use embeddings to provide the model with new information. Basically the goal is to find relevant information and add it to the context window for the model to reference.

I don’t know how effective this would be for translation tasks, but it might be worth a try.

Something else you may consider would be to use a high-resource intermediary language—likely English. My thinking here is you might have an easier time going from Old Hawaiian to English and then English to Modern Hawaiian as there may have been more examples of those translations in the training data than direct Old → Modern.

4 Likes

Hey there @alex.petrescu! I suggest visiting this link: Welcome to Spinning Up in Deep RL! — Spinning Up documentation, it was created by OpenAI and talks about reinforcement learning (RL). I remember when I was studying to attempt to work at OpenAI how much content they have around RL that is public. I believe that it’s going to be the quickest solution, as I think for better results you might need to fine tune gpt-4 which is not available yet. There are multiple routes on how to solve this, I would just build my own AI on this while some of the new stuff from OpenAI isn’t out yet.

Best of luck Alex!

2 Likes

Awesome effort man. @MadieLaine has been working on an effort she opensourced for the Mi’kmaq language. She mentions here a little bit about it.

2 Likes