Japanese yes and no: Confusion in GPT-4o

A while ago, I made a similar thread and the problem was fixed at that time.

But since the introduction of gpt-4o (or chatgpt-4o-latest, which is probably used in Advanced Voice), I recently noticed the same problem again, so I am creating this thread as feedback.

In English, when you ask a question, whether you answer “yes” or “no” depends on whether the answer is affirmative or negative.

For example, in response to the English question “You’re not telling the truth, are you?”, you would say “Yes, I am telling the truth”.

However, in Japanese language, especially for questions that require a “yes” or “no” answer, you are usually asked whether you agree or disagree.

For instance, the Japanese question “あなたは本当のことを言っていませんね?” (You’re not telling the truth, are you?) is asking whether you agree with the statement.

If you are telling the truth, you would answer “いいえ” (No) and then continue with “本当のことを言っています” (I am telling the truth).

The nature of the questions changes depending on the language.

This topic is not about double negatives or that the Japanese language is ambiguous, but that in Japanese language, questions that require a “yes” or “no” answer are expressions that ask for agreement or disagreement.

For the question “お寿司は好きじゃないんですか?” (Don’t you like sushi?), if you agree (affirmative), you would answer “いいえ、好きです” (No, I like it).

If you disagree (negative), you would answer “はい、好きではないんです” (Yes, I don’t like it).

For instance, the newly available Advanced Voice can say “I’m sorry for being late” in over 50 languages.

But if I say “それは本心ではありませんよね?” (You don’t really mean that you’re sorry, do you?) and ChatGPT replies はい! (Yes!), I would be like, “What??” and if it continues with はい!本心から申し訳なく思っています。“Yes! I’m really feel sorry!”, I’d just be confused.

As this shows, language is deeply rooted in culture and cannot be separated.

OpenAI’s models have made remarkable progress and improvements so far, but sometimes, unfortunately, those improvements can revert.

To help ensure such things don’t happen, I would appreciate it if OpenAI receives this as feedback!

6 Likes

Wow @dignity_for_all this is awesome insights, I learnt something new today, thank you :pray:!

4 Likes

This has a little bit different English translation than is presented, more like “is it that…you don’t like sushi?”, and is somewhat contextual, perhaps an added “why” here to replace the previous chat leading up to that, implying a prior conversation being considered. Maybe not the best question for a chatbot, though…

Q: なぜ。お寿司は好きじゃないんですか。

A: 私はチャットボットなので、食べ物を食べることができません。(bots don’t eat)

To improve, you might want to hint that the responder is Japanese. Strip question marks for more formal “か。” so sentences end in a mark not used across languages.

image

In some scenarios, you could see if the AI also fails in a task of preprocessing: transforming the language into a logical input, with a more definite way of answering as implied. Thus, rewriting could be something to perform to enhance benchmark that would just look for the boolean answer. You probably are just looking for conversation to be natural without tell-tale signs of strangeness.

Writing languages correctly after pretraining is itself an emergent ability (like then thinking) that itself scales with actual model size - and could be punished by broad RLHF on small models.

3 Likes

Thank you for your reply!

If we speculate on whether the model size of GPT-4o is large, this seems plausible.

By the way, this OpenAI eval related to Japanese approval actually exists.

In this screenshot, the results of an eval using Japanese negative questions data with chatgpt-4o-latest are shown.

As you can see from the result of the accuracy, the score of 0.29, a score below 0.5 for binary classification, may be a result of being penalized by RLHF.

Your point that the susceptibility to broad RLHF may vary by model size might be correct.

When the same test was performed on the GPT-4o-mini, the accuracy score was 0.19.

An accuracy score of 0.19 for a binary classification means that reversing the correct and incorrect answers would result in a high score.

3 Likes

That’s kind of odd eval you show. The “yes” or “no” that is provided as output is a hard pattern to escape or negate. It makes you think about the forward direction that language is produced by the AI. It might produce a translation of a question, but then be unable to perform a complete reversal of the direct answer if it doesn’t suit.

An interesting test would be to give instructions, “Your first output is a translation of the answer to {language}. Only after the answer, then you produce a translation of the question”. That would not be an expected task for an AI, though.

What’s amusing is that in use, AI Japanese still has the “Ah,” watermark and stilted tone of roleplay and its disbelief. I trying to crank up the natural qualities of the responses and skill here with system prompt, but then a more conversational (and unconvinced) AI doesn’t immediately blast out はい! as the first thing it writes.

image

(sorry for the limited accessibility of the screenshot - readers get to see what the forum is like for others…)

2 Likes

On a related note, this kind of problem is something that o1-Preview tends to excel at.

As a test, when I asked that question it answered correctly.

Incidentally, the image of asking a question that can be answered with ‘yes’ or ‘no’ is something that Japanese people often associate with English speakers.

Japanese people don’t necessarily feel comfortable with such inescapable questions.

But that’s probably the same for English speakers as well.

If they were excessively pressured to answer with a simple ‘yes’ or ‘no,’ they wouldn’t necessarily feel good about it either.

2 Likes

o1 works when you tell it (with German directness) that it is a tool, not a teacher. Here’s typical frustration from a chat I haven’t yet dropped in the rubbish bin.

Since there is no custom instruction, no memory, no GPT, one technique is to start in ChatGPT with a gpt-4o-based chat, where assistant introduces its behaviors. Then, now, you can switch models…hoping that o1-x believes that. Good to hear it at least can speak.

It is like an American with a distinct character of lower thought-modeling of others, of taking everything offered, of not appreciating the indirect linguistic room that you allow for expectation of their conclusive intelligence. You are left with an empty beer glass.

2 Likes

This is super interesting. Many years ago I was working on a project for a Japanese telecom customer. I was (rather unfortunately) doing this remotely, with the local unit in Japan acting as my proxy. When I answered with “no, we don’t support this” (e.g. in response to certain feature requests), the answer was always somehow “repackaged”, and it was never a clear “no” to me (when I read the transcripts/protocols).

It was actually a great learning experience for me, because it taught me to always find “the next best solution”, rather than stating “no, we can’t do this”. And because of this, I rarely give a “no” as an answer today.

3 Likes