Using GPT4.1 as a Translator: Returns too few translations or Source text as it is

So I’m working on a translation evaluation project and while testing my code, that for the language translation Dutch (nl) to Danish (da), it just sends back Dutch. I myself don’t speak either languages but I realized that the BLEU scores were way too low for the translation to be correct (like between 1-2) and then noticed that the translation looked closer to the source text than the reference translation, so it didn’t translate at all but gave back a slightly altered version of the source text.

The strange thing is, the prompt is the same I used for other language pairs and for those it worked fine. So the task of translation Dutch-to-Danish specifically seems to cause issues.

Has anyone encountered something similar?

2 Likes

LLMs aren’t universal translators. They perform well for the most common language pairs, but some combinations will definetly have poorer performances, as in your example.

You can search for Multilingual Massive Multitask Language Understanding (MMLU) in the model’s system card to try to find more information.

You can also try gpt-4.5-preview that is a larger model to see if it performs a little better, but it’s more expensive too.

2 Likes

out of interest, were you using the full model, or mini/nano?

I was using gpt-4.1, this one likely: gpt-4.1-2025-04-14 (which I assume is called when you specify model ‘gpt-4.1’

1 Like

Using county abbraviation together with language works well.

You may try this, but about code translation, you should modify what you want to translate in code:

You understand all languages and are a flawless translator between them.

Translate from Dutch (nl) to Danish (da). Only return the Danish translation. Do not repeat the Dutch input or add any explanation.

If the input contains code:

  • Translate only the comments, variable names, and string literals that are in natural language.
  • Do NOT translate programming keywords, syntax, or function names from standard libraries.
  • Keep the code structure and logic intact.

Always return valid, executable code where applicable.

Does it make a difference what is part of the system prompt and what is part of the user prompt?

Please read more here:

https://platform.openai.com/docs/guides/text?api-mode=responses

2 Likes

@na50r

Consider using a Developer (System) Prompt. Here is mine:

Developer Prompt: “You are a linguistics expert specializing in translations. Do not provide additional commentary. Just perform the task at hand. Leave web URLs as is.”

Prompt: “Translate the following from Dutch to Danish:”"

You can also do a Spell and Grammar check.

Prompt: “Convert the following statements to standard Dutch:”

Cool stuff…

Here is you post:

So I’m working on a translation evaluation project and while testing my code, that for the language translation Dutch (nl) to Danish (da), it just sends back Dutch. I myself don’t speak either languages but I realized that the BLEU scores were way too low for the translation to be correct (like between 1-2) and then noticed that the translation looked closer to the source text than the reference translation, so it didn’t translate at all but gave back a slightly altered version of the source text.

The strange thing is, the prompt is the same I used for other language pairs and for those it worked fine. So the task of translation Dutch-to-Danish specifically seems to cause issues.

Has anyone encountered something similar?

OpenAI translation from English to Dutch:

Dus ik werk aan een project voor het evalueren van vertalingen en tijdens het testen van mijn code, waarbij voor de taalvertaling van Nederlands (nl) naar Deens (da) werd gekozen, wordt er gewoon Nederlands teruggestuurd. Zelf spreek ik geen van beide talen, maar ik merkte dat de BLEU-scores veel te laag waren om een correcte vertaling te zijn (ongeveer tussen 1-2) en toen viel me op dat de vertaling meer leek op de brontekst dan op de referentievertaling, dus er werd helemaal niet vertaald maar gewoon een iets aangepaste versie van de brontekst teruggegeven.

Het vreemde is dat de prompt dezelfde is als die ik voor andere taalkoppels gebruikte, en daar werkte het prima. Dus specifiek de taak van vertalen van Nederlands naar Deens lijkt problemen te veroorzaken.

Heeft iemand iets soortgelijks meegemaakt?

OpenAI translation from Dutch to Danish:

Så jeg arbejder på et projekt til evaluering af oversættelser, og under testningen af min kode, hvor jeg valgte sprogoversættelsen fra nederlandsk (nl) til dansk (da), bliver der bare sendt nederlandsk tilbage. Jeg taler ikke nogen af sprogene selv, men jeg bemærkede, at BLEU-scorerne var alt for lave til at være en korrekt oversættelse (omkring 1-2), og så lagde jeg mærke til, at oversættelsen lignede kildeteksten mere end referenceoversættelsen, så der blev slet ikke oversat, men bare givet en let tilpasset version af kildeteksten tilbage.

Det mærkelige er, at prompten er den samme som den, jeg brugte til andre sprogpar, og der virkede det fint. Så specifikt opgaven med at oversætte fra nederlandsk til dansk ser ud til at give problemer.

Er der nogen, der har oplevet noget lignende?

OpenAI Spell and Grammar Check from Danish to Danish:

Så jeg arbejder på et projekt til evaluering af oversættelser, og under testningen af min kode, hvor jeg valgte sprogparret nederlandsk (nl) til dansk (da), bliver der blot returneret nederlandsk. Jeg taler ikke nogen af sprogene selv, men jeg bemærkede, at BLEU-scorerne var alt for lave til at være en korrekt oversættelse (omkring 1-2), og så lagde jeg mærke til, at oversættelsen lignede kildeteksten mere end referenceoversættelsen, så der blev slet ikke oversat, men bare givet en let tilpasset version af kildeteksten tilbage.

Det mærkelige er, at prompten er den samme som den, jeg brugte til andre sprogpar, og der virkede det fint. Så specifikt opgaven med at oversætte fra nederlandsk til dansk ser ud til at give problemer.

Er der nogen, der har oplevet noget lignende?

From Danish back to English:

So I am working on a project for evaluating translations, and during the testing of my code, where I chose the language pair Dutch (nl) to Danish (da), only Dutch is being returned. I don’t speak either of the languages myself, but I noticed that the BLEU scores were far too low for it to be a correct translation (around 1-2), and then I noticed that the translation resembled the source text more than the reference translation, so nothing was actually translated, but just a slightly adapted version of the source text was returned.

The strange thing is that the prompt is the same as the one I used for other language pairs, and there it worked fine. So specifically, the task of translating from Dutch to Danish seems to be causing problems.

Has anyone experienced anything similar?


The translations are correct - and a minor grammar change.

The first project we did with OpenAI was about a year ago. For months, we tested 134 languages. For many languages, the results were very good. Some African and Hindic languages didn’t translate at all.

It’s important to realize that a language represents a culture of a people. Some cultures have no concept of another cultures culture. For example, using OpenAI we created a short story of a young girl growing up with dragons. Most translations of that story went well (e.g. Chinese, Hindi) because the cultures knew what a dragon was. There were a few other languages that translated dragon into alligator because their culture didn’t know what a dragon was - at least that was my take.

Since a year ago, OpenAi (with gpt 4.1) has made much progress with language translations.

I think you guys are missing the point… it’s not that LLMs can’t understand the prompt to translate. That is a very simple statement.

But you can’t tell it: Here is your system prompt, “You are the smartest AI in the universe, you have achieved AGI and can now speak all languages including Klingon and Vulcan fluently.”

There is a limit to what a model is trained for.

Well, Developer | System prompts are invaluable in translations. As shown above, my Developer prompt: “You are a linguistics expert specializing in translations. Do not provide additional commentary. Just perform the task at hand. Leave web URLs as is.” fixed two things:

(1) A couple of weeks ago I was translating from English to Hindi: It responded with the first sentence in English: “Here is your translation.” followed by the Hindi translation. Not acceptable. I modified the developer prompt and it was fixed.

(2) I encountered a corner case where URL links were getting translated. The modified developer prompt fixed that.

As you mentioned, I’m not so sure if the “Assistant” type verbiage does anything - maybe a ploy that eats tokens… :face_with_open_eyes_and_hand_over_mouth:

The bottom line is that my example above shows @na50r that English-to-Dutch-to-Danish-to-English translations using OpenAI is spot-on accurate. There is something that he is doing wrong that maybe a Developer could solve.

1 Like

I add the following, this is the prompt I used:

System:
You are a $src_lang-to-$tgt_lang translator.

User:
Translate the following $src_lang sentences into $tgt_lang.
Please make sure to keep the same formatting, do not add more newlines.
Here is the text:

The text is appended to the next line. The text is generated by concatenating a list of sentences with newlines, i.e. with '\n'.join(sents)

I encountered two issues with this prompt:

  • It failed to generate the right number of sentences most of the times (sometimes it did work)
  • Ocassionally, it just seemed to return the text back in source language. Dutch and Danish do look very similar but you can use BLEU scores to confirm what it returned, it will be in single digits for one of them.

Important: The prompt DID work for several other language pairs.

I have demonstrated that OpenAI easily performs Dutch and Danish translations. The best way to solve your issue is for you to capture the JSON input you sre sending to OpenAI and post it here for us to look at.

I found that for non standard languages, providing a dictionary is helpful.

I have to agree with this.
The more I experiment, the more I feel like my prompts aren’t making it do what I want it to do, they increase the probability that it does it. The easiest way to show this is by having it do the same task with the same prompt in a for loop. For one prompt, it may get the correct output more often than for the other. It is a bit annoying that I cannot enforce determinism aside from setting temperature to 0. It is also annoying that the same prompt makes it work for most languages but then do weird stuff for very specific pairs…

1 Like

What do you mean by “non standard languages”? Give an example. What do you mean by “providing a dictionary”? Give an example.

Please post here an example of a JSON request (using a non standard language and dictionary) for us to see. I want to make sure that you know what you are talking about.

Again, provide an example of a JSON request. Otherwise you are just going around in circles on the topic and will be considered, IMHO, a waste of time.

from string import Template
SYS_TEMPL = Template("You are a $src_lang-to-$tgt_lang translator.")

USR_TEMPL = Template(
    "Translate the following $src_lang sentences into $tgt_lang.\n"
    "Please make sure to keep the same formatting, do not add more newlines.\n"
    "You are not allowed to omit anything.\n"
    "Here is the text:")

# Sentences separated by newlines
# Usually read from a .txt file using f.read()
text = '''
Sent1\n
Sent2\n
Sent3\n
'''

sys_prompt = SYS_TEMPL.substitute(src_lang='Dutch', tgt_lang='Danish')
user_prompt = USR_TEMPL.substitute(src_lang='Dutch', tgt_lang='Danish')
user_prompt = '\n'.join(user_prompt, text)
from openai import OpenAI
cli = OpenAI(api_key=api_key)
resp = cli.chat.completions.create(
    model='gpt-4.1',
    temperature=0,
    messages=[
        {'role': 'system', 'content': sys_prompt},
        {'role': 'user', 'content': user_prompt},
    ]
)

This is how I structure my calls.
Note, my main issue is that when I send it n sentences, it returns me significantly less than n.
n in this case refers to the number of lines in the string, number of strings separated by newline characters.

It will work occasionally but most often it does not. Specifically for nl-da.

In some cases, it also did not translate at all and just returned back the same source text but after more experiments, that happened rarely (I was unable to reproduce it). This on the other hand is reproducible.

This prompt is more or less stable for most other language pairs I used, that’s why I was confused. Why should the prompt that works for many other languages be the bottleneck of this. The task is not complex.

I am talking about languages that most don’t think are important like Kriol and Garifuna. And I don’t need any complex code to get it to work. I store the dictionary in a small language model and give it a system prompt to use the dictionary to translate the English response. Way simpler than what is being proposed. Since I speak Kriol, I can validate the responses and they are consistently correct.

Reading from a text file can be problematic, but here are a couple of things that may help:

(1) Convert the text file into a string, append the string to user prompt and handle all special JSON escape characters:

(2) Set temperature to 0.4
(3) Set top_p to 0.8
(4) You are mixing your system prompt with your user prompt. Your user prompt should say:
“Translate the following from Danish to Dutch:”
(5) For now, forget the system prompt.
(6) Make sure JSON output is encoded to UTF-8

Try the above. If there are still problems, capture your JSON request and post it here like the one below.

Here is a actual JASON request that I made in a previous post:

{“model”: “gpt-4.1”,
“messages”: [{“role”: “developer”, “content”: “You are a linguistics expert specializing in translations. Do not provide additional commentary. Just perform the task at hand. Leave web URLs as is.”},
{“role”: “user”, “content”: “Translate the following from Dutch to Danish: Dus ik werk aan een project voor het evalueren van vertalingen en tijdens het testen van mijn code, waarbij voor de taalvertaling van Nederlands (nl) naar Deens (da) werd gekozen, wordt er gewoon Nederlands teruggestuurd. Zelf spreek ik geen van beide talen, maar ik merkte dat de BLEU-scores veel te laag waren om een correcte vertaling te zijn (ongeveer tussen 1-2) en toen viel me op dat de vertaling meer leek op de brontekst dan op de referentievertaling, dus er werd helemaal niet vertaald maar gewoon een iets aangepaste versie van de brontekst teruggegeven.\r\rHet vreemde is dat de prompt dezelfde is als die ik voor andere taalkoppels gebruikte, en daar werkte het prima. Dus specifiek de taak van vertalen van Nederlands naar Deens lijkt problemen te veroorzaken.\r\rHeeft iemand iets soortgelijks meegemaakt?\r”}],
“temperature”: 0.4,
“top_p”: 0.8}

EDIT IMPORTANT READ: I think I see your problem. You are appending \n to every sentence. Actually, \n is a LINE FEED - some entities (e.g. Microsoft) call it a New Line. So don’t do that. Follow the above JSON Escape Character advise from MS.

I can see the benefits for having your own Kriol small langual model. Since Kriol is an English based language, does your dictionary just map English words to Kriol words?

Do you accomplish this with OpenAI?