Translation ignores proper names

Hello everyone, I am using the GPT model gpt-3.5-turbo-1106 to translate messages passed in (messages may contain HTML tags) into English. The issue I’m facing is that currently, in the entire English conversation, there are some places where the signature is the proper name of another country. For example, the proper name of Vietnam is always detected as Vietnamese by my language detection component, but my expectation is that it should be detected as English.

Some approaches I’ve tried:

  1. Instructing GPT to ignore proper nouns or proper names (this doesn’t work; the instruction is: ‘Ignore detecting proper nouns and proper names’).
  2. Instructing GPT to synthesize the most frequently appearing language in the message to detect the language. However, this is not feasible because some cases involve two languages (for example, the message contains both English and Spanish, and if the Spanish language has fewer words, I need to detect the language as Spanish, not English).
    Does anyone have any ideas for the above situation? This is the current prompt I am providing to GPT for translation.
[
  {
    "content": "You are the language model. Your task is to detect language of the given text and translate it to English (keep html format)\n\
        Specifically, you are required to:\n\
        1. Detected language: Identify the language (ISO 639-1) used in the given text. (Eg. Chinese, Spanish)\n\
        2. Detected language code: Identify the language code (ISO 639-1 codes) used in the given text (Eg. zh, es)\n\
        3. Translated text: Translate ALL given text to English and keep html format.\n\
        4. Reason detection: Give the reason for your Detected language result.\n\
        It's crucial to always provide the output in JSON format",
    "role": "system"
  },
  {
    "content": "Translate ALL following text to ${targetLanguage}: '${message}' and keep html format\
        Your output should be structured in JSON format and must include the following fields:\
        - Detected language (Eg. Chinese, Spanish)\
        - Detected language code (Eg. zh, es)\
        - Translated text (to English, keep html format)\
        - Reason detection\
        Remember to utilize all the provided data in generating your responses.",
    "role": "user"
  }
] 

You have two duplicate but yet potentially conflicting instructions.

I’m just going to rewrite system and user instructions to accept any input and output language desired, instruct the JSON better, and also put the generation of JSON keys in an order than ensures highest cognitive deduction.

I still don’t quite understand what you mean by “signature” being a “country name”, but I added some instructions of what not to translate.

Modifying Python code I already had open, the new messages should be understandable:

import openai; client = openai.OpenAI()
system = """
You are Translate Pro, a backend processor that automatically translates written language to desired destination language, maintaining all other formatting.

// Required output format is only valid JSON, with these key:value pairs

original_language_justification: Explain AI analysis of input used to discover the predominant original language
original_language: Name of the input language (Chinese, Spanish, ...)
original_language_iso_code: ISO 639-1 two-letter abbreviation of detected input language (zh, es, ...)
output_language_translation: text and code with original language parts translated into English

// Translation notes

- do not translate code elements such as HTML tags
- do not translate or rewrite proper names or names of countries

""".strip()

user_template = f"Translate to {{language}}:\n\n---\n\n{{input}}"

def translate(in_text, out_lang="English"):
    user = user_template.format(language=out_lang, input=in_text)
    try:
        response = client.chat.completions.create(
            model="gpt-3.5-turbo", max_tokens=500, top_p=0.1,
            messages=[
                {"role": "system", "content": system},
                {"role": "user", "content": user},
                ],
        )
    except Exception as e:
        print(f"Error in OpenAI call: {e}"); raise
    return response.choices[0].message.content

if __name__ == "__main__":
    text = "<div>Finally some <strong>code<\strong> by Jay (_j) that will demonstrate usage</div>"
    output_json = translate(text, "Vietnamese")
    print(output_json)

Scroll down in the code box to also see my example input. The JSON produced:

{
  "original_language_justification": "The input text contains HTML tags and a proper name (Jay (_j)). The AI analysis detects that the original language is English.",
  "original_language": "English",
  "original_language_iso_code": "en",
  "output_language_translation": "<div>Cuối cùng có một <strong>code<\strong> của Jay (_j) sẽ thể hiện cách sử dụng</div>"
}

Hopefully this will perform well on the names you wish to preserve, or the instructions can be easily modified now.