I’ve been experiencing some quirky issues with the latest GPT-4-1106-preview and GPT-3.5-turbo-1106 models, particularly in their interaction with a custom function we’ve developed in our application. This issue seems to extend beyond just Chinese names and may involve other elements in Chinese language processing.
In our application, the GPT models are utilized to process user queries and interact with a specific function we’ve implemented. This function requires accurate input of certain elements from the user queries, such as names, for successful execution. For instance, in a query like “帮我比较一下葛兰和张坤” (“Help me compare Ge Lan and Zhang Kun”), the function relies on the model to correctly identify and input “葛兰” (Ge Lan) and “张坤” (Zhang Kun) into designated fields within the function.
However, we have noticed a recurring issue where the models incorrectly substitute these names – replacing “葛兰” (Ge Lan) with variations like “蒋兰” (Jiang Lan) or “莹兰” (Ying Lan), and “张坤” (Zhang Kun) with “张均” (Zhang Jun). A similar problem occurs with queries such as “帮我比较一下万民远和赵蓓吧” (“Help me compare Wan Minyuan and Zhao Bei”), where “赵蓓” (Zhao Bei) gets incorrectly altered to “赵蓝” (Zhao Lan) or “赵贝” (Zhao Bei).
This pattern of inaccuracies raises a concern that it might not be limited to names but could extend to other aspects of Chinese language processing. The issue is particularly puzzling as it was not observed in previous versions of GPT, whether provided by OpenAI or Azure, including GPT-3.5 and GPT-4. Interestingly, when replicating these scenarios directly in GPTs, the models correctly identify and use the names mentioned in the user queries.
Is this something to do with the models’ training on non-English languages, or could it be an issue with how they’re interacting with our function through API calls? Would love to get some thoughts or advice on this. Maybe it’s something obvious I’m missing?
After conducting further tests, I’ve discovered that this issue of misinterpretation also occurs on GPTs. Additionally, after deploying the 1106 model on Azure, I conducted tests and found that the issue still persists and occurs very frequently not only on OAI, but also AOAI. I’ve posted example code below for reproduction purposes.
Code:
from openai import AzureOpenAI
import json
import os
client = AzureOpenAI(
api_key=os.getenv("AZURE_OPENAI_KEY"),
azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
api_version="2023-07-01-preview"
)
def run_conversation(query):
# Step 1: send the conversation and available functions to GPT
messages = [{"role": "user", "content": query}]
functions = [
{'name': 'get_information_of_the_manager',
'description': 'Enquire Fund Manager info given the manager name',
'parameters': {'type': 'object',
'properties': {
"manager_name": {"type": "string", "description": "fund manager name"}
}
}
}
]
response = client.chat.completions.create(
model="gpt4-1106",
messages=messages,
functions=functions,
function_call="auto", # auto is default, but we'll be explicit
)
response_message = response.choices[0].message
# Step 2: check if GPT wanted to call a function
if response_message.function_call:
json_data = response_message.dict()
manager_name = json.loads(json_data["function_call"]["arguments"])
print(manager_name)
query = '帮我比较一下赵蓓和万民远'
for _ in range(3):
run_conversation(query)
I’ve found the newest model to have pretty high perplexity, being uncertain about the word choices. The symptom can be brought out by adjusting the temperature higher: where GPT-4 would have coherent thoughts still, this model quickly devolves into nonsense output.
I see in your API calls that you don’t have this specification. Assistants are worse because they don’t allow adjustment at all.
So add this to your chat completion call as another parameter alongside “model” and see if the results don’t improve:
top_p = 0.5, temperature = 0.5
That will put restrictions on the diversity of tokens that are allowed to be chosen, especially the case where it isn’t just a creative choice, but is “the wrong answer”.
This has already been a good recommendation for rarer non-English world languages with less pretrained data, where even the settings of ChatGPT cause grammar errors that could be avoided.
Thanks a lot for your response and the inspiration! I gave your suggestion a try, setting top_p and temperature as you mentioned. Unfortunately, it didn’t quite do the trick – GPT still seems to be getting things mixed up.
I’ve upgraded the code and included error detection feature. This should make it easier for non-Chinese speakers to debug. Thanks again for your help!
Code:
from openai import OpenAI, AzureOpenAI
import json
import os
# client = AzureOpenAI(
# api_key=os.getenv("AZURE_OPENAI_KEY"),
# azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"),
# api_version="2023-07-01-preview"
# )
client = OpenAI(
api_key=os.environ.get("OPENAI_API_KEY"),
)
def run_conversation(query):
# Step 1: send the conversation and available functions to GPT
messages = [{"role": "user", "content": query}]
functions = [
{'name': 'get_information_of_the_manager',
'description': 'Enquire Fund Manager info given the manager name',
'parameters': {'type': 'object',
'properties': {
"manager_name": {"type": "string", "description": "fund manager name"}
}
}
}
]
response = client.chat.completions.create(
model="gpt-4-1106-preview",
messages=messages,
functions=functions,
function_call="auto",
)
response_message = response.choices[0].message
# Step 2: check if GPT wanted to call a function
if response_message.function_call:
json_data = response_message.dict()
manager_name = json.loads(json_data["function_call"]["arguments"])
return manager_name
query = '帮我比较一下赵蓓和万民远'
for _ in range(5):
manager_name = run_conversation(query)
if manager_name['manager_name'] == '赵蓓':
print(manager_name, f'({manager_name["manager_name"]}, 赵蓓) Correct!')
else:
print(manager_name, f'({manager_name["manager_name"]}, 赵蓓) Incorrect!')