I am using the Python code below to translate a large number of English strings into different languages. I separate thousands of English strings into chunks of 100 strings each, then join all of them into a single string (so I can send it to the AI) using the ┊ character as a separator, expecting the returning result to be the same number of strings translated in a desired language, also separated by the ┊ character.
def batch_translate(originals: List[str], strings: List[str], language: str) -> List[str]:
"""Translate a batch of strings at once."""
try:
to_translate = []
# Filter out strings that are already translated, leaving empty strings only
for i in range(len(strings)):
if strings[i] == "" and originals[i] != "":
to_translate.append(originals[i]) # Put the corresponding English string in the translate queue
if not to_translate:
return strings # Return early if nothing to translate
prompt = "\n┊\n".join(to_translate)
response = openai.chat.completions.create(
model="gpt-4o-mini",
messages=[
{
"role": "system",
"content": f"You are a professional translator. Translate the following English texts to {language}. Each text is separated by '┊'. Respond with only the translations, separated by '┊', in the same order. You MUST return exactly the same number of translations in order, even if some translations are duplicates. Maintain strict 1-to-1 mapping."
},
{
"role": "user",
"content": prompt
}
],
temperature=0.3
)
translations = response.choices[0].message.content.strip().split("┊")
translations = [t.strip() for t in translations]
final_translations = []
j = 0
for i in range(len(strings)):
if strings[i] == "" and originals[i] != "":
final_translations.append(translations[i])
j += 1
else:
final_translations.append(strings[i])
return final_translations
It works fine as expected most of the time, but there seems to be like 1 in 1000 chance where AI would just skip a string in a list and return fewer translations.
For example, I send the following strings:
Date of Birth┊Date of Death, if deceased┊Town/city of birth┊Country of birth┊Marital status
its result would be like this (imagine it was in another language, say Korean):
Date of Birth┊Date of Death, if deceased┊Country of birth┊Marital status
So what happened here is that it just skipped the 3rd string, Town/city of birth, and returned only 4 strings. Because of this, when I parse the response using the separator character, I run into an out of index error.
The weirder thing is that this is not random. The same strings GPT decided to ignore get skipped repeatedly no matter how many times I run the program. It’s not that GPT is unable to recognize or translate the string “Town/city of birth” because when I give it another batch (another chunk of strings I prompt to translate) that contains the same string, it does translate it and returns the same number of strings.
Even worse, each language seems to have its own standards when it comes to skipping a string because the same string that gets skipped in one language is not skipped in another language. Each language seems to have its own problem strings and I don’t see a pattern at all.
Whenever I ran into this problem, the only way I was able to get GPT translate a problematic string was by making it the only string in a batch. That way GPT recognizes the string and returns a translation. So my best guess is that the string when wrapped and combined with its neighbouring strings confuses GPT? Then how are they translated in other languages no-problem?
But my strings are not so different from the example I provided above. It’s not like my strings are very long or contain special characters, and my intuitions tell me that should not matter as long as the strings don’t contain the sepratator character, ┊.
Could this be a problem on my side, or it’s just GPT acting unexpectedly?