Ideas for enhancing the quality of translation

We recently implemented a quick PoC using GPT-4 to translate some domain-specific content from English to Mandarin. The domain in this case is investment research. The translation was given a score of 8 out of 10 by a domain expert who also speaks Mandarin natively. The same content previously translated using Azure AI Translator was given a score of 7 out of 10 so we seem to be on the right path with GPT-4.

However, the domain expert highlighted some words in Mandarin which just do not make sense or fit in the context. One example of this was the word “write-off” in the context of financial investment. This was translated literally to mean “write down” which is wrong. This is probably because either there is no like-for-like word for “write-off” in Mandarin or the whole sentence needs to be written differently to make sure of the most apt word in Mandarin. I was told, the most apt word in this example would translate to either “exit” or “remove” in English.

So, here are my questions - How do I tell GPT-4 to pay special attention to words like this? If I have a dictionary of such words (which I do not at the moment) can I use this as examples included in my system prompt? Are there others ideas someone has tried and I can borrow from?

Trying to mitigate each and every potential mistake would be an absolute nightmare to manage and would eventually be detrimental to the model.

Have you considered trying a refinement process? You can apply some prompting techniques.

For example, you can let the model “reason” how it will perform the translation, it may be able to “prepare/frame” the translation and avoid these pitfalls. You can also try running it through a “rigorous translation expert” afterwards to score the translation.

So 1 assistant to create translation
1 assistant to score

Scoring assistant is given a rubric. If the score is below 8/10 for example you can send this back to the translation assistant, essentially creating a feedback loop.

I never considered that I could build another assistant to score the translation produced by the first assistant. I will give this a try. I am wondering what would be the instructions for this scoring assistant. Specifically, if I am going to use GPT-4 for both the translation and scoring assistant, how will the scoring assistant know that the translation assistant could have done a better job?

Through a rubric.

You can get multiple levels and variations of content by different prompting. An easy example is giving GPT a CEFR level first and then asking it to translate something.

CoT is a proven technique to improve quality of responses

Consider it this way: Imagine you’re translating this immediately, and once a word is chosen, it’s set in stone, compelling you to proceed despite any mistakes you might notice.

By “priming” the translation you give the model the ability to reason how the translation should be done.

Then, by having a different model with a different “evaluation mindset” you can grade it using the rubric, passing the translation back if it’s below satisfactory. The eval model should be able to note the issue, which would make the translation model more “aware” by solving it.

1 Like