I am trying to make GPT3 to translate documents and video subtitles from English to another language. The target language is already supported by the default database and the playground results show pretty good, google translate level translations for generic/easy sentences.
However, some of the topics covered by the document that I am trying to translate are very industry-specific and the translated vocabularies in the target language often have to be written in both English and the target language, since the program for which the documents are written hasn’t yet been translated into the target language and currently only supports English.
Therefore, there are often these awkward situations where I have to write out the original English script and its translation in parenthesis as part of the completion of my dataset, like so: Target(English).
This is particularly noticeable in video subtitle translation since the videos I’m translating go over many of the menu items of a program, and the speaker fumbling between speeches doesn’t help.
The various documents that need to be translated also differ in tone and GPT3 does too good of a job in tracking only the tones of the sample completions- the translated documents often have the tone that follows its sample completions almost exactly, while missing out on key vocabularies and details.
Here’s a list of difficulties I’m facing:
Difference in tone between different documents that don’t translate well
(too fine tuned to fine-tuning samples)
Having to write English vocabulary in both English and in target language confuses the model
Target language(Korean) uses non latin-1 character set and has a completely reversed grammatical structure to English which makes structuring difficult for complex, compound sentences
So far I’ve almost given up on training GPT3 to translate video subs(it’s a crapshoot for the most part), and while the document translation works relatively well I have to fix much of the vocabulary here and there which is not an easy task in itself because the documents are in PDF.
Here is a short list of things I’ve already tried and their problem:
Give GPT3 a list of [English prompt] + [translated completion] sentence pair
Translates sentence by sentence well (I’m putting in strings split by a period) but sometimes a random sentence is converted into some unmatching sentence that is in the dataset.
Give GPT3 a list of [English prompt] + [Translated completion] paragraph pair
Does not translate sentence by sentence well at all, as GPT3 seems to think that the target language should only be processed in paragraphs per the fine-tuning dataset. Therefore, it omits any sentence that is not part of the paragraph in the fine-tuning dataset.
Results in sporadic, cut-off sentences that don’t make any sense in the target language. This method of fine-tuning results in fewer variations over different translation attempts, but it still appears from time to time.
Give GPT3 a list of [English prompt] + [Translated completion] sentence pair with a list of vocabs and their translation
This was an attempt to affix certain core technical terms to certain target outputs. While it did have an effect it never got close to perfection; also had a very interesting side effect where an entire sentence would be (incorrectly) translated into a single word.
So what should I do to improve? One thing I’ve contemplated doing was refining the fine-tuning dataset and the testing set such that they would only contain complete and clear sentences. For instance:
“Open the folder on the top right, excuse me, top left side. I’m sorry. I’m just failing to speak today.”
Would be converted to
“Please open the folder on the top left corner of your screen.”
I’m willing to take any suggestions but I’d also like to know if what I’m aiming for is actually achievable by fine-tuning. Do you think such technical translation might be out of the scope of what GPT3 does well? Or not feasible through just fine-tuning? Let me know what you think.