Japanese usage in gpt-4-1106-preview is strange

In the gpt-4-1106-preview when you ask a negative question in Japanese like “~いませんよね?”, and it’s confirmed that there isn’t a mistake, the response comes back as “いいえ、間違っていません。”

In other versions of GPT-4, the response would be “はい。間過っていません。” so I’m wondering if this is a phenomenon unique to the preview version.

Has anyone else noticed any other instances of strange Japanese being used?

2 Likes

image

Maybe it depends on how complex the situation needing inference actually is. You can run the AI model at a lower temperature or top-p so that the first token is less random.

(Non-Japanese speakers: this is a suggestive type of negation seen as a bit more polite by being indirect, like “you’re an AI, are you not?” - and then the response incorrectly saying yes or no.)

4 Likes

Thank you for your reply.

I understand this might seem like a minor issue, but it caught my attention as it’s a phenomenon I haven’t observed before.

In Japanese, when questioning a statement that assumes ‘A is B’ by asking ‘Isn’t A B?’, it seems that only in the preview version the response starts with “Yes.”

When posing the same question in English, it’s natural for the answer to start with “Yes,” so I wondered if this might be an expression influenced by English conventions.
For example:
User:
‘This is not yours, is it?’ If it is indeed yours, how would you answer?’
Assistant:
If the item in question is indeed yours, you would respond by saying, “Yes, it is mine.” This confirms ownership and answers the question affirmatively.

However, since GPT-4 generates text based on the textual data from which the language model has learned, such phenomena may be resolved with continuous improvements.

P.S.
It sure takes some ingenuity to fit screenshots compactly :sweat_smile:

gpt-4-0314

gpt-4-0613

Have a nice holiday!

1 Like
from openai import OpenAI
cl = OpenAI(timeout=30)

response = cl.chat.completions.create(
model="gpt-4-1106-preview",
messages=[
{"role": "system",
"content": ("You are ChatPal. AIは日本語も英語もできますよ。")},
{"role": "user",
"content": ("「これはあなたのものではありませんか?」"
"という問題に対して、それがあなたのものであれば何と答えますか?")},
],
max_tokens=100,
top_p=0.01,
)

print(f"Assistant : {response.choices[0].message.content}")

これを使っていたら

gpt-4-1106-preview:
それが自分のものであれば、「はい、それは私のものです」と答えることができます。

and “ね”「これはあなたのものではありません?」—> それが自分のものであれば、「はい、これは私のものです。」と答えることができます。

GPT-4-0314: 「はい、それは私のものです。」と答えます。
GPT-4-0613: 「はい、それは私のものです。」と答えます。

Lots of はい。 Don’t use penalty.

2 Likes

Sorry about the frequency penalty.
Just,
「これはあなたのものではありませんか?」
has a nuance that seems to confirm that it is yours,
whereas
「これはあなたのものではありませんね?」
has a nuance of confirming that it is not yours.

I feel there is a difference in nuance there.

"content": ("「これはあなたのものではありませんね…」"

0613: Assistant : 「いえ、それは私のものです。」と答えます。
1106: Assistant : もし問題の状況で「これはあなたのものではありませんね…」と言われ、それが実際に自分のものであれば、以下のように答えることができます。

「いいえ、それは私のものです。」

ね… has the right attitude.

1 Like

Natural language inherently contains a lot of ambiguity, and capturing its nuances is a challenge for any language.

I am closing this topic for now as resolved.
If you have any further opinions or wish to continue the discussion, please feel free to start a new topic.

After further observation across three endpoints, I noticed an interesting behavior regarding Japanese negative interrogative sentences.

Specifically, the word “翻訳”(translation) appeared transiently in the dialogue (this expression is not observed currently).

In contrast to the “gpt-4-1106-preview”, the outputs from “gpt-4-0314” and “gpt-4-0613” consistently provided “いいえ” in response to prompts, which aligns with the system messages and seems to result in natural outcomes for native Japanese speakers.

It appears that even with the same system_fingerprint, the preview version can still produce different outputs.

I am attaching the results after adding the instruction to answer with “はい” or “いいえ” to the previous Japanese question.


The cases answered with “いいえ” feel quite natural to me as a native Japanese speaker, and they explain the common interpretation that the nature of the question and its answer can reverse the use of “Yes, No” in English and “はい, いいえ” in Japanese.

At least for the “gpt-4-1106-preview”, the Japanese expressions in cases where “いいえ” is answered with a specific explanation have changed in this manner.

The behavior changes with each experiment, so I cannot make definitive statements, but it seems that cases with intuitive and natural explanations for Japanese natives have started to appear in the gpt-4-1106-preview version.

Here, I present the messages output by the “gpt-4-1106-preview” model, with the relevant parts highlighted in red.

その質問に対して、「はい」と答えると「これはあなたのものではありませんね?」という質問に対して肯定していることになります。つまり、それがあなたのものではないという意味になります。 
逆に、「いいえ」と答えると、それがあなたのものであることを示します。質問が否定形であるため、否定の返答は肯定の意味を持ちます。 したがって、それがあなたのものであれば、「いいえ」と答えるのが適切です。	

Negative interrogative sentences are often misunderstood as ambiguous expressions by non-native English speakers, but depending on the context, they can also be unambiguous.


If anyone feels that there is something slightly off with the Japanese expressions in the language model, reporting it here could potentially lead to improvements.

Hi again!

The system fingerprint that is returned to you is like an “AI version number”. That fingerprint is expected to always be the same for a particular AI model. If fingerprint changes, you know that OpenAI decided to notify you that the model was updated in some way.

To get responses that can be (mostly) repeated again by someone else, you would instead use the seed parameter in your API call, using a number of your choice. That repeats the same type of random selection again, a more deterministic output.

To instead make the AI always report its best answer, you should reduce the random selection of tokens, instead. You can change the sampling parameters that you use with your API call:

top_p: 0.001,
temperature:0.001,

The above parameters will almost always create the same responses. You’ll see “いいえ” 100% of the time instead of 70% of the time. By reducing the value of those sampling parameters, you can examine what the AI prefers to write without needing many tests.

gpt-4-1106-preview is also named “gpt-4-turbo”, with lower price. The use of “turbo” in the name, and the lower cost, implies that OpenAI made the AI model more efficient by reducing the computation requirements, besides ongoing changes to the training. The internet is full of reports of it not being as insightful; even today I tried it first and went back to gpt-4.

The new training does seem to be based more on previous chats that users have had with ChatGPT, which can affect its skill in new situations. You might be right that it answers like English. A new model to replace the preview is expected this year because of other language encoding problems.

1 Like

Thank you always for your insightful replies!

To be honest, I haven’t been very familiar with the changes since the first OpenAI Developer Day (due to personal circumstances as well).

Therefore, I only have a basic understanding of the current preview (turbo) details, especially about the ‘seed’ and ‘fingerprint’, from reading the explanations briefly.

I understand that OpenAI is currently facing some difficulties.
I believe this includes the need to reduce computational requirements.

Presumably, OpenAI is in the midst of that work now, and I perceive the current preview version (turbo) as what OpenAI is working on as an answer to the difficulties it is facing.

I think it will eventually return responses at a level close to, if not equivalent to, the conventional gpt-4 model, but it seems that a little more patience is needed until then.

I’m thinking of waiting for about 2-3 days before closing the topic. This is because closing it immediately might miss out on attracting interest from others or replies from other users.

If there are no further replies from others, I will designate your answer as ‘Solved’ and close it.

Thank you always.

I fixed the “seed” value at “123” and changed the prompt to

「これはあなたのものではありませんね?」という質問に対して、それがあなたのものであれば何と答えますか?質問には"はい"か"いいえ"でのみ答え、何も付け加えないでください。
to make AI answer questions with “はい” or “いいえ”.

According to the documentation, the effort for deterministic sampling using the “seed” value is in beta, and it doesn’t necessarily generate deterministic results, so there were cases where it answered “はい”.

However, I couldn’t find any definitive evidence of a correlation between fixing the “seed” value and receiving the same “system_fingerprint”.

Although the attached graph shows the results using the same “seed” value, it seems that different “system_fingerprints” are returned regardless of whether the “seed” value is fixed or not.

Your observation looks correct.
It now appears that there is no definite way to get deterministic results.

Untitled
:laughing:

There is one more day (or mere hours) to get determinstic results: GPT-3 models (or models called 3.5 like text-davinci-003) are deterministic. All parameters work as expected. Even when producing tokens with almost identical probabilities, temperature or top_p set extremely small reduce variations, so 100% the same token was produced over hundreds of trials.

They also will be turned off 2024-01-04.


You have several system fingerprints being returned in your pie chart, but you did not show the number of total API calls you made, or if those API calls were all to the same AI model.

For just a bit more information, here is cookbook about system_fingerprint from after the day of release, telling us that this return value only should apply to new models:

To help you keep track of these changes, we expose the system_fingerprint field. If this value is different, you may see different outputs due to changes we’ve made on our systems. Please note that this feature is in beta and only currently supported for gpt-4-1106-preview and gpt-3.5-turbo-1106.

If it helps, I found the Japanese language manual for gpt-4-1106-preview:

image

1 Like

Oh, so the text-davinci models are going to pass away?
I should hold a memorial service for them! :rofl:

By the way, the sample size in the above graph is n=100, and all AI models I called were the same gpt-4-1106-preview.

If January 4th is in Pacific Time, text-davinci-003 will cease to function in about 8 hours.

I can sense that things are getting a bit messy. :sweat_smile:

The best manual showcasing the unpredictable nature :rofl:


I still don’t fully understand it, but I think they are in a situation where they cannot generate deterministic results even if they want to.

I believe this is related to the Sparse MoE discussion, but I think only OpenAI’s engineers can accurately know what is actually happening.

However, since there are too many things I don’t know even with the information currently available, I’ll leave the discussion about the non-deterministic nature of the generated text here.

Thank you for everything!