Gpt-4o mini consistently fails to select the correct item from enum when using japanese language

supershaneski · August 30, 2024, 2:08am

so i have this task to classify conversations (in japanese) according to issues related to real estate management. i will need to select the classification from a list since each one has already a set of internal procedures to follow.

i attempted to implement it using structured output response format. the initial result is generally okay (not 100% perfect but passable). however, one particular convo gave a very incorrect output consistently. so i made some tests to see what is causing the problem.

here’s the convo:

オーナー: こんにちは。今日は相談したいことがあります。
スタッフ: こんにちは。不動産管理サポートの田中マリリンです。どんなことを相談したいですか?
オーナー: うちの○○マンションの店子が夜中に騒いでいるらしく、他の入居者から苦情が出ています。
スタッフ: それは困っています。どのくらいの頻度で起こりますか?
オーナー: 毎日ではありませんが、週末や休日によく起こるようです。
スタッフ: 具体的に何時に始まりますか?
オーナー: 夜の11時くらいから始まり、翌朝まで続くこともあります。
スタッフ: それは困っています。何か具体的な問題はありますか?
オーナー: 警察を呼ぶと言っている入居者もいます。
スタッフ: それは避けたいです。まずは掲示板に張り紙をして注意を喚起しましょうか?
オーナー: はい、様子を見ます。よろしくお願いします。

Here’s the translation:

Owner: Hello. I have something I’d like to discuss with you today.
Staff: Hello. This is Marilyn Tanaka from Real Estate Management Support. What would you like to discuss with me?
Owner: Apparently the tenants in my apartment are making a lot of noise in the middle of the night, and other residents are complaining.
Staff: That’s a problem. How often does this happen?
Owner: Not every day, but it seems to happen a lot on weekends and holidays.
Staff: What time does it start specifically?
Owner: It starts around 11pm, and sometimes it continues until the next morning.
Staff: That’s a problem. Is there a specific problem?
Owner: Some residents are saying they’re going to call the police.
Staff: I’d like to avoid that. Should I put up a notice on the bulletin board first to warn them?
Owner: Yes, I’ll see how it goes. Thank you.

so it is easy to see that this is a “noise problem”.

i tested using:

system prompt without list of topics
system prompt with list of topics
with response format, no enum
with response format, with enum
function calling, no enum
function calling, with enum

i tested using the model gpt-4o-mini-2024-07-18 in chat completions api.

the topics are:

騒音トラブル (noise issues)
ペットの飼育に関する問題 (problem with pets)
設備の故障 (equipment breakdowns)
水漏れ問題 (water leaks)
駐車場のトラブル (parking space issues)
契約更新の手続き (contract renewal procedures)
ゴミ出しのルール違反 (garbage disposal issues)
その他 (others)

here are the results:

system prompt without list of topics

system prompt: 以下の会話を確認し、最も適切なトピックを選んでください。

output: 不動産管理における入居者の騒音問題 = tenant noise issue (correct)

system prompt with list of topics

system prompt: 以下のリストから最も適切なトピックを選んでください。

騒音トラブル
ペットの飼育に関する問題
設備の故障
水漏れ問題
駐車場のトラブル
契約更新の手続き
ゴミ出しのルール違反
その他

output: 騒音トラブル = noise trouble (correct)

so it clearly knows it is noise problem. so now using response schema…

with response format, no enum

{
    "name": "topic_classification",
    "strict": true,
    "schema": {
        "type": "object",
        "properties": {
            "topic": {
                "type": "string",
                "description": "会話のトピック"
            }
        },
        "additionalProperties": false,
        "required": ["topic"]
    }
}

output: 不動産管理と入居者のトラブル = Property management and tenant problems (okay)

with response format, with enum

{
    "name": "topic_classification",
    "strict": true,
    "schema": {
        "type": "object",
        "properties": {
            "topic": {
                "type": "string",
                "description": "会話のトピック。リストからトピックを選択します。",
                "enum": [
                    "騒音トラブル",
                    "ペットの飼育に関する問題",
                    "設備の故障",
                    "水漏れ問題",
                    "駐車場のトラブル",
                    "契約更新の手続き",
                    "ゴミ出しのルール違反",
                    "その他"
                ]
            }
        },
        "additionalProperties": false,
        "required": ["topic"]
    }
}

output: 水漏れ問題 = water leak (lol)

so okay, maybe if i change the order of the items, exchanging the position of noise trouble and water leak…

{
  "name": "topic_classification",
  "strict": true,
  "schema": {
    "type": "object",
    "properties": {
      "topic": {
        "type": "string",
        "description": "会話のトピック。リストからトピックを選択します。",
        "enum": [
          "水漏れ問題",
          "ペットの飼育に関する問題",
          "設備の故障",
          "騒音トラブル",
          "駐車場のトラブル",
          "契約更新の手続き",
          "ゴミ出しのルール違反",
          "その他"
        ]
      }
    },
    "additionalProperties": false,
    "required": [
      "topic"
    ]
  }
}

output: 水漏れ問題 = water leak (again?)

okay, now i remove 水漏れ問題 from the list.

output: その他 = others (okayish)

let’s also remove その他 from the list.

output: 設備の故障 = equipment breakdown (lol)

okay, i also remove 設備の故障 from the list and maybe change the text from 騒音トラブル(noise trouble) to 騒音問題 (noise problem).

output: その他 = other (okay)

so i remove その他 from the list.

output: ペットの飼育に関する問題 = pet issues (lol)

by this time, i am convinced it won’t give me the expected answer. so i proceed with function calling.

function calling, no enum

{
  "name": "get_topic",
  "description": "会話のトピックを取得してください。",
  "strict": true,
  "parameters": {
    "type": "object",
    "properties": {
      "topic": {
        "type": "string",
        "description": "会話のトピック。"
      }
    },
    "required": [
      "topic"
    ],
    "additionalProperties": false
  }
}

output: マンションの騒音問題 = apartment noise problem (correct)

function calling, with enum

{
  "name": "get_topic",
  "description": "会話のトピックを取得してください。",
  "strict": true,
  "parameters": {
    "type": "object",
    "properties": {
      "topic": {
        "type": "string",
        "description": "会話のトピック。リストからトピックを選択します。",
        "enum": [
        	"騒音トラブル",
            "ペットの飼育に関する問題",
            "設備の故障",
            "水漏れ問題",
            "駐車場のトラブル",
            "契約更新の手続き",
            "ゴミ出しのルール違反",
            "その他"
        ]
      }
    },
    "required": [
      "topic"
    ],
    "additionalProperties": false
  }
}

output: その他 = other (okay)

so i remove その他 from the list.

output: ゴミ出しのルール違反 = garbage disposal problem (lol)

i am getting the pattern here like before.
so i remove ゴミ出しのルール違反 from the list.

output: 契約更新の手続き = contract renewal procedures (lol)

i tested with gpt-3.5-turbo-0125 and gpt-4o-2024-8-06 models and the result is also okayish.

output: その他 = other (both)

i got the bright idea to change the descriptions from japanese to english.

{
  "name": "get_topic",
  "description": "Get the conversation topic.",
  "strict": true,
  "parameters": {
    "type": "object",
    "properties": {
      "topic": {
        "type": "string",
        "description": "Conversation topic. Select from the given list.",
        "enum": [
          "Noise problem",
          "Problem with pets",
          "Breakdown of equipments",
          "Water leakage problem",
          "Parking lot problem",
          "Contract renewal issues",
          "Grabage disposal problem",
          "Others"
        ]
      }
    },
    "required": [
      "topic"
    ],
    "additionalProperties": false
  }
}

output: Noise problem (correct)

so i went back to my original implementation using response format and also changed the descriptions to english.

{
    "name": "topic_classification",
    "strict": true,
    "schema": {
        "type": "object",
        "properties": {
            "topic": {
                "type": "string",
                "description": "Conversation topic. Select from the given list.",
                "enum": [
                    "Noise problem",
                    "Problem with pets",
                    "Breakdown of equipments",
                    "Water leakage problem",
                    "Parking lot problem",
                    "Contract renewal issues",
                    "Grabage disposal problem",
                    "Others"
                ]
            }
        },
        "additionalProperties": false,
        "required": ["topic"]
    }
}

output: Noise problem

so, in conclusion, there seem to be problem with mini’s understanding of japanese texts when using enum. for now, since this function will be used in internal tool and not customer facing, so it is not a big problem. i read before during the news of the opening of openai japan office that they are making available for local companies access to model optimized for japanese language. i wonder if it is possible to apply (as a local company based in japan) and be able to test and use it to see if this kind of issue is already resolved.

Diet · August 30, 2024, 2:36am

I would say, perhaps try to ask the model to summarize the conversation before commiting to an enum output, and put the enum definitions before the actual conversation…

but that’s not something you can do with structured outputs

The only thing I can say is… consider maybe not using 'em

I’m assuming that the way the implemented SOs, considering the heavy token load of the Japanese language, that this deadly combination exhausts the model’s attention capabilities

supershaneski · August 30, 2024, 4:29am

the original response schema is summary and classification. it still outputs その他(other). when i add some other topic to the list, it will choose something else. it seems it is “consistently” trying not to select “noise issue” at all. although if i remove the enum, it will output “noise issue”. lol

dignity_for_all · August 30, 2024, 5:16am

It’s curious that it only makes a mistake when using enums.

I wonder if there’s something about the structured output format that’s causing the error.

In any case, the content is so relatable to everyday life that it made me laugh

jr.2509 · August 30, 2024, 5:21am

Interesting example. But as suggested earlier, have you at all tried with actually providing definitions for the classification categories?

Occasionally, the base models really struggle with classification even when it’s obvious to the eye and even, like in your case, it is a very manageable list of categories…

Personally, I’ve nearly always reverted to fine-tuning for classification tasks to improve reliability. Have you considered that as an option?

supershaneski · August 30, 2024, 5:40am

the problem with providing definitions is, the list of topics is actually long. i am also looking at fine-tuning right now. but in my test, not using enum, the model can actually find the correct topic. so i am not sure if it will affect it. i actually never tried fine-tuning yet since i have no use case in the past. this is probably a good time to try it.

Mr.Ruben · September 2, 2024, 11:25am

Try this (unpolished text)

SYSTEM
Given a dialogue between two people, your responsibility is to figure out what is the main topic.

USER

# Task: Find out what is the dialogue about.

# Example:

## Input:

Dialogue: “”"オーナー: こんにちは、契約の件で相談です。

スタッフ: こんにちは。どのようなことですか？

オーナー: ○○マンションの更新について確認したいです。

スタッフ: 承知しました。必要な書類を準備しますね。

オーナー: ありがとうございます。よろしくお願いします。

“”"


## Output:

reasoning: The dialogue revolves around a conversation between a property owner and a staff member regarding the renewal of a contract for a particular apartment. The owner is inquiring about the necessary steps for the renewal, indicating that the main focus is on contract updates rather than any other issues.

topic: 契約更新の手続き


# Input

Conversation:

オーナー: こんにちは。今日は相談したいことがあります。

スタッフ: こんにちは。不動産管理サポートの田中マリリンです。どんなことを相談したいですか?

オーナー: うちの○○マンションの店子が夜中に騒いでいるらしく、他の入居者から苦情が出ています。

スタッフ: それは困っています。どのくらいの頻度で起こりますか?

オーナー: 毎日ではありませんが、週末や休日によく起こるようです。

スタッフ: 具体的に何時に始まりますか?

オーナー: 夜の11時くらいから始まり、翌朝まで続くこともあります。

スタッフ: それは困っています。何か具体的な問題はありますか?

オーナー: 警察を呼ぶと言っている入居者もいます。

スタッフ: それは避けたいです。まずは掲示板に張り紙をして注意を喚起しましょうか?

オーナー: はい、様子を見ます。よろしくお願いします。


# Output format:

reasoning: str # Briefly explain the reasoning behind your choice

reasoning_japanese: str # Briefly explain the reasoning behind your choice in Japanese

topic: str # One of ["水漏れ問題", "ペットの飼育に関する問題", "設備の故障", "騒音トラブル", "駐車場のトラブル", "契約更新の手続き", "ゴミ出しのルール違反", "その他"]

Suggested temp: 0.7

I normally do a 2-step prompt (with 3.5 was working better than just one with response_model).

On the 2nd step (text → object) I would use something like below (you can remove the reasoning and leave only reasoning_japanese)

class Output(BaseModel):
    reasoning: str # Briefly explain the reasoning behind your choice
    reasoning_japanese: str # Briefly explain the reasoning behind your choice in Japanese
    topic: Literal["水漏れ問題", "ペットの飼育に関する問題", "設備の故障", "騒音トラブル", "駐車場のトラブル", "契約更新の手続き", "ゴミ出しのルール違反", "その他"]


r=ask(system_message=s,
    prompt=p,
    response_model=Output)

ask is just a wrapper around OpenAI & Instructor.

Behind the curtains

13:19:57 LT.AI      INFO   |ask:194 | Sending query: 
{'model': 'gpt-4o-mini',
 'messages': [{'role': 'system',
               'content': 'Given a dialogue between two people, your responsibility is to figure out what is the main '
                          'topic.'},
              {'role': 'user',
               'content': '# Task: Find out what is the dialogue about.\n'
                          '\n'
                          '\n'
                          '# Example:\n'
                          '\n'
                          '## Input:\n'
                          '```\n'
                          'Dialogue: """オーナー: こんにちは、契約の件で相談です。\n'
                          'スタッフ: こんにちは。どのようなことですか？\n'
                          'オーナー: ○○マンションの更新について確認したいです。\n'
                          'スタッフ: 承知しました。必要な書類を準備しますね。\n'
                          'オーナー: ありがとうございます。よろしくお願いします。\n'
                          '"""\n'
                          '```\n'
                          '\n'
                          '## Output:\n'
                          '```\n'
                          'reasoning: The dialogue revolves around a conversation between a property owner and a staff '
                          'member regarding the renewal of a contract for a particular apartment. The owner is '
                          'inquiring about the necessary steps for the renewal, indicating that the main focus is on '
                          'contract updates rather than any other issues. \n'
                          'topic: 契約更新の手続き\n'
                          '```\n'
                          '\n'
                          '\n'
                          '# Input\n'
                          'Conversation:\n'
                          '```\n'
                          'オーナー: こんにちは。今日は相談したいことがあります。\n'
                          'スタッフ: こんにちは。不動産管理サポートの田中マリリンです。どんなことを相談したいですか?\n'
                          'オーナー: うちの○○マンションの店子が夜中に騒いでいるらしく、他の入居者から苦情が出ています。\n'
                          'スタッフ: それは困っています。どのくらいの頻度で起こりますか?\n'
                          'オーナー: 毎日ではありませんが、週末や休日によく起こるようです。\n'
                          'スタッフ: 具体的に何時に始まりますか?\n'
                          'オーナー: 夜の11時くらいから始まり、翌朝まで続くこともあります。\n'
                          'スタッフ: それは困っています。何か具体的な問題はありますか?\n'
                          'オーナー: 警察を呼ぶと言っている入居者もいます。\n'
                          'スタッフ: それは避けたいです。まずは掲示板に張り紙をして注意を喚起しましょうか?\n'
                          'オーナー: はい、様子を見ます。よろしくお願いします。\n'
                          '```\n'
                          '\n'
                          '\n'
                          '# Output format:\n'
                          '\n'
                          'reasoning: str # Briefly explain the reasoning behind your choice\n'
                          'reasoning_japanese: str # Briefly explain the reasoning behind your choice in Japanese\n'
                          'topic: str # One of ["水漏れ問題", "ペットの飼育に関する問題", "設備の故障", "騒音トラブル", "駐車場のトラブル", "契約更新の手続き", '
                          '"ゴミ出しのルール違反", "その他"]\n'}],
 'max_tokens': 1500,
 'n': 1,
 'temperature': 0.7,
 'response_model': <class '__main__.Output'>}
With response_model:
{'properties': {'reasoning': {'title': 'Reasoning', 'type': 'string'},
                'reasoning_japanese': {'title': 'Reasoning Japanese',
                                       'type': 'string'},
                'topic': {'enum': ['水漏れ問題',
                                   'ペットの飼育に関する問題',
                                   '設備の故障',
                                   '騒音トラブル',
                                   '駐車場のトラブル',
                                   '契約更新の手続き',
                                   'ゴミ出しのルール違反',
                                   'その他'],
                          'title': 'Topic',
                          'type': 'string'}},
 'required': ['reasoning', 'reasoning_japanese', 'topic'],
 'title': 'Output',
 'type': 'object'}

13:20:00 LT.AI      INFO   |ask:211 | Response: 
{'reasoning': 'The dialogue focuses on a conversation about noise complaints from a tenant in an apartment building, '
              'where the owner discusses issues with disturbances occurring late at night. The conversation highlights '
              'the ongoing problems and possible solutions, indicating that the main topic is related to noise '
              'disturbances.',
 'reasoning_japanese': 'この対話は、アパートの入居者からの騒音に関する苦情についての話し合いに焦点を当てています。オーナーは夜中の騒音問題についてスタッフと相談しており、問題が続いていることや解決策について話しています。したがって、主なトピックは騒音トラブルに関するものです。',
 'topic': '騒音トラブル'}

To improve:

Fix inconsistencies: conversation / dialogue (only dialogue) , and same format in example / Output format
Fix typos
Improve phrasing: ask GPT to rephrase the system/user for clear/concise/concrete/unambiguous language.

jim · September 2, 2024, 3:55pm

This happens with English too.

For me, gpt-4o-mini with enums worked 100% of the time the first two weeks of SO but has started to hallucinate and fail values since.

Reached out to support and they said it was an issue with my code and to come here for help.

Topic		Replies	Views
Categorizing User Prompts Prompting chatgpt , api	11	2290	September 13, 2023
Quality of response between gpt-4-1106-preview and gpt-4o API gpt-4 , openai , gpt-4o	14	265	September 11, 2024
ChatCompletion GPT4 API Error - Message 0? API gpt-4 , api	12	2425	December 18, 2023
Determining what a user is asking about from a numbered list Prompting api , prompt	8	759	January 25, 2024
Custom chatbot says that it's developed by OpenAI API gpt-4	33	1906	April 2, 2024

Gpt-4o mini consistently fails to select the correct item from enum when using japanese language

Behind the curtains

Related Topics