Translating novels from English to Sinhala and Tamil

Need help on Prompt: Translating novels from English to Sinhala and Tamil.

I gave the prompt as

prompt = f"Translate the following English text to Sinhala as a native speaker ': {text}"

Words like ‘blacksmith’ is translated wrongly.

Any advice to improve prompt. Thank you.

Please provide more details: if it is the only prompt or there are preparation prompts, System role use, Tempearure and Top_p settings, etc.

Please check if this thread could be helpful:
Force api response to be in non English language. How?

Hi Thanks a lot for the prompt response.

#THIS FUNCTION USES gpt-3.5-turbo - Sinhala Translate
def translateSinhala(text):
prompt = f"Translate the following English text to Sinhala as a native speaker ': {text}"
response = openai.ChatCompletion.create(

model=“gpt-3.5-turbo”,
messages=[
{“role”: “system”, “content”: “You are a helpful assistant that translates text.”},
{“role”: “user”, “content”: prompt}
],
temperature=0,
max_tokens=700,
top_p=1,
frequency_penalty=0,
presence_penalty=0,
)

translation = response.choices[0].message.content.strip()
return translation

Above is the prompt I am using. Sometimes translated lines are translated.

Thanks
Nithya

I tried the following conversation with a model:

User: Please give options for translating the word "blacksmith" into
the Sinhala language. 

The responses were not satisfactory - so I tried a small fine-tuning.

User: I remake the 5 options translation list you gave me for the word
"blacksmith" in the Sinhala language:
1. කලුන්ගොඩවන්නා - Kalungodannaa - no translation, proper name (?);
2. කහ සුදුකාරයා - Kahasudukaraya - "The yellow and white person" - bad
translation;
3. කර්මාන්තකරු - Karamanthakaru - "Industrialist" - similar but not
good;
4. කර්මාන්තක - Karamanthaka - "Industry" - related but bad
translation;
5. කලුන්ගොඩන්ගේ - Kalungodangae - no translation, proper name (?);

Then you provided a sixth option:
6. කරුණාකරු - Karuṇākaru" - it seems a form of the "please" - bad
translation;

I provided two options for you:
1. කම්මල්කරු - Kammalkaru - "blacksmith";
2.  කම්මල්කරුවා - Kammalkaruvā - "blacksmith";

The only thing in common is the first part of these words:  "ක - Ka".
Please provide a concise and logical explanation. Do NOT apologize. 
Is it necessary a fine-tuning for the Sinhala language? Is it a
problem of "tokenization" or "embeddings"?

The GPT-3.5 model @ Temperature = 0.7 was:

Assistant: I cannot speak to the accuracy of the translations, as I am
not an expert in the Sinhala language. It is possible that there are multiple
translations for a single word, and the appropriate translation may
depend on context.
...
It is possible that the quality of the translations may be affected by
the tokenization and embeddings used by the NLP model.
Fine-tuning the model on Sinhala language data may improve the accuracy
of the translations. It is also possible that the quality of the
translations may be affected by the availability and quality of Sinhala
language training data.

I am very interested in this case since such an answer as I am not an expert in... is uncommon. The diagnosis seems to be correct:

  1. the quality of the translations may be affected by the tokenization and embeddings used by the NLP model.
  2. the quality of the translations may be affected by the availability and quality of Sinhala language training data.
    It seems that the Sinhala language is beyond the limits of the OpenAI models - maybe GPT-4 performs better in Sinhala.

There is a three-level solution for this:

  1. Level 1 - Settings, Prompt, and Fine-tuning
  • Chose a model that could perform better in Sinhala.
  • Reduce Top_p. It ranges from 0 to 1, and it also affects the word probability similar to Temperature - start with Top_p=0.5 then you can reduce (or increase) during the fine-tuning.
  • Try to increase max_tokens - tokenization is very efficient in English, but not in other languages, mainly in languages with limited model training.
    A suggestion to calculate a max_token for translation:
Eng_tokens = 3/4 * (number of words of the English text);
Sin_tokens = 2 * Eng_tokens;
max_tokens = (Eng_tokens + Sin_tokens) + 15%;
  • Extensive use of the System role for context-maintenance - otherwise you have to replicate all the instructions for every single translation in the User prompt. Mind punctuation and delimiters - one instruction per line. Like this:
System:
Please follow the instructions below:
1. Perform as a professional-level translator of the English  language
into the Sinhala language;
2. Use the Sinhala language for the translation ONLY. Translate the
entire text;
3. The `max_token=1000`. Advise if the translation requires more tokens
to improve the quality of the entire translation;
4. Accept the recommendations provided in the `User` prompt for a better
translation and keep them in the context memory during the translation
process.
  • Prompt reinforcing the context instructions in the System role - for example:
User:
Follow the instructions provided in the `System` role, keeping in the
context of this conversation all the time.
Please translate the following text  in triple quotes delimiters into
the Sinhala language: """{text}"""
  • And then process the fine-tuning (or training) of the model:
User:
Your translation of the word "blacksmith" as "කර්මාන්තකරු"
(Karamanthakaru, "Industrialist") is incorrect.
Use the Sinhala word "කම්මල්කරු" (Kammalkaru) to translate the word
"blacksmith".
  1. Level 2 - Dataset and Dictionary
  • Consider providing a dataset in a text (free-) format but structured containing the translation of words that the model performs badly - in order to compensate for the lack of training in the Sinhala language. Upload the dataset to a cloud storage of your choice. One translation per line (or per group of lines), mind punctuation, and structure.
DATASET: English-Sinhala Dictionary;
[Header Section]
Author: Nithya.k;
Contents: This dataset contains a dictionary of English-Sinhala words
and sentences of less frequent use. The Sinhala vocabulary may have more
than one option depending on the usage frequency (a number from 0 to 1).
The dataset has the following structure:
eng: "English word or sentence",
sin-01: "Sinhala word or sentence",
trans-01: "transliteration",
freq-01 = usage frequency value,
sin-02: "optional Sinhala word or sentence",
trans-02: "transliteration",
freq-02 = usage frequency value,
sin-03: "Other option of Sinhala word or sentence,
freq-03 = usage frequency value;
...
[End of Header Section]

[Dictionary Section]
...
eng: "blacksmith",
sin-01: "කම්මල්කරු", trans-01: "Kammalkaru", freq-01: 0.7,
sin-02: "කම්මල්කරුවා", trans-02: "Kammalkaruvā", freq-02: 0.3;
...
  • And in the System role:
System:
Consider the dataset: {url address of the dataset}
Please follow the instructions below:
1. Perform as ...
  • Or in the User prompt:
User:
Consider the dataset: {url address of the dataset}
Follow the instructions provided in the `System` role, keeping in the
context of this translation process all the time...
  1. Level 3 - Embeddings
  • The use of embeddings, mathematical vectors applied to words in a context, demands a more programmatic approach and more research. We can get into more details later. Since the models present some difficulties in the Sinhala language, we could be suspicious of how the Sinhala language would perform under embeddings. It is hard to predict the context similarities between an English sentence and a translated Sinhala sentence. We can talk about this later, and more experienced users could advise better procedures for the use of embeddings in translations notably, in a language in which the models were poorly trained.

Please let me know your choices.

Great, 2 option is best I think. Please help me write the prompt.

Sometimes, translted lines are repeating like

please send me a solution. Thanking you

@nithya.k
These are levels of a solution, not options - the choice is which level is enough to get a desired translation.

First, try Level 1: adjust Top_p and max_tokens.
Then you may implement the extensive use of the System role in the prompt. I can help you with that - please send me a private message. If you can’t, tell me here and I will message you privately. When (or if) we got a solution, we may post here to help others.