This problem isn’t fixed at all. I have a chatbot using OpenAI’s API that uses function calling to search through Google and Bing, heavily utilized by my Chinese-speaking friends on a daily basis. But this bug makes the bot’s search almost unusable.
It’s not about the hassle of decoding Unicode in JSON. The real issue is the GPT model not creating the correct Unicode escape sequence for less common Chinese characters, leading to completely wrong characters. Here’s a full demo that you can test to see for yourself.
Here is the reproduce example:
import openai
import os
import json
client = openai.OpenAI(api_key = os.getenv('OPENAI_API_KEY'))
models = [
'gpt-3.5-turbo-16k',
'gpt-4-1106-preview',
]
query = '邓紫棋'
for model in models:
for i in range(10):
result = client.chat.completions.create(
model=model,
messages=[
{'role': 'system', 'content': 'You are a helpful assistant with searching capabilities'},
{'role': 'user', 'content': f'Please search for "{query}"'}
],
tools=[{
"type": "function",
"function": {
"name": "search",
"description": "Search on Google",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query",
},
},
"required": ["query"],
}
}
}],
)
arguments = result.choices[0].message.tool_calls[0].function.arguments
decoded_arguments = json.loads(arguments)
print(model, i + 1, repr(arguments), '-->', decoded_arguments)
And this is what the program’s output looks like (note that the old model works fine, but the new model often spits out the wrong Chinese characters):
gpt-3.5-turbo-16k 1 '{\n "query": "邓紫棋"\n}' --> {'query': '邓紫棋'}
gpt-3.5-turbo-16k 2 '{\n "query": "邓紫棋"\n}' --> {'query': '邓紫棋'}
gpt-3.5-turbo-16k 3 '{\n "query": "邓紫棋"\n}' --> {'query': '邓紫棋'}
gpt-3.5-turbo-16k 4 '{\n "query": "邓紫棋"\n}' --> {'query': '邓紫棋'}
gpt-3.5-turbo-16k 5 '{\n"query": "邓紫棋"\n}' --> {'query': '邓紫棋'}
gpt-3.5-turbo-16k 6 '{\n "query": "邓紫棋"\n}' --> {'query': '邓紫棋'}
gpt-3.5-turbo-16k 7 '{\n"query": "邓紫棋"\n}' --> {'query': '邓紫棋'}
gpt-3.5-turbo-16k 8 '{\n "query": "邓紫棋"\n}' --> {'query': '邓紫棋'}
gpt-3.5-turbo-16k 9 '{\n "query": "邓紫棋"\n}' --> {'query': '邓紫棋'}
gpt-3.5-turbo-16k 10 '{\n "query": "邓紫棋"\n}' --> {'query': '邓紫棋'}
gpt-4-1106-preview 1 '{"query":"\\u90a3\\u7d2b\\u68a8"}' --> {'query': '那紫梨'}
gpt-4-1106-preview 2 '{"query":"\\u9093\\u7d2b\\u68a8"}' --> {'query': '邓紫梨'}
gpt-4-1106-preview 3 '{"query":"\\u90a3\\u7d2b\\u68a8"}' --> {'query': '那紫梨'}
gpt-4-1106-preview 4 '{"query":"\\u9093\\u7d2b\\u68cb"}' --> {'query': '邓紫棋'}
gpt-4-1106-preview 5 '{"query":"\\u90a2\\u7d2b\\u68a8"}' --> {'query': '邢紫梨'}
gpt-4-1106-preview 6 '{"query":"\\u9093\\u7d2b\\u68a8"}' --> {'query': '邓紫梨'}
gpt-4-1106-preview 7 '{"query":"\\u90a3\\u7d2b\\u68a8"}' --> {'query': '那紫梨'}
gpt-4-1106-preview 8 '{"query":"\\u90a3\\u7d2b\\u68a8"}' --> {'query': '那紫梨'}
gpt-4-1106-preview 9 '{"query":"\\u90a3\\u7d2b\\u68a8"}' --> {'query': '那紫梨'}
gpt-4-1106-preview 10 '{"query":"\\u90a2\\u7d2b\\u68cb"}' --> {'query': '邢紫棋'}
Another example:
gpt-3.5-turbo-16k 1 '{\n "query": "新型冠状病毒疫情"\n}' --> {'query': '新型冠状病毒疫情'}
gpt-3.5-turbo-16k 2 '{\n "query": "新型冠状病毒疫情"\n}' --> {'query': '新型冠状病毒疫情'}
gpt-3.5-turbo-16k 3 '{\n "query": "新型冠状病毒疫情"\n}' --> {'query': '新型冠状病毒疫情'}
gpt-3.5-turbo-16k 4 '{\n "query": "新型冠状病毒疫情"\n}' --> {'query': '新型冠状病毒疫情'}
gpt-3.5-turbo-16k 5 '{\n "query": "新型冠状病毒疫情"\n}' --> {'query': '新型冠状病毒疫情'}
gpt-3.5-turbo-16k 6 '{\n "query": "新型冠状病毒疫情"\n}' --> {'query': '新型冠状病毒疫情'}
gpt-3.5-turbo-16k 7 '{\n"query": "新型冠状病毒疫情"\n}' --> {'query': '新型冠状病毒疫情'}
gpt-3.5-turbo-16k 8 '{\n "query": "新型冠状病毒疫情"\n}' --> {'query': '新型冠状病毒疫情'}
gpt-3.5-turbo-16k 9 '{\n "query": "新型冠状病毒疫情"\n}' --> {'query': '新型冠状病毒疫情'}
gpt-3.5-turbo-16k 10 '{\n "query": "新型冠状病毒疫情"\n}' --> {'query': '新型冠状病毒疫情'}
gpt-4-1106-preview 1 '{"query":"\\u65b0\\u578b\\u51a0\\u72b6\\u75c5\\u6bd2\\u75ab\\u60c5"}' --> {'query': '新型冠状病毒疫情'}
gpt-4-1106-preview 2 '{"query":"\\u65b0\\u578b\\u51a0\\u72b6\\u75c5\\u6bd2\\u75ab\\u60c5"}' --> {'query': '新型冠状病毒疫情'}
gpt-4-1106-preview 3 '{"query":"\\u65b0\\u578b\\u51b7\\u51fb\\u75c5\\u6bdb\\u75ab\\u60c5"}' --> {'query': '新型冷击病毛疫情'}
gpt-4-1106-preview 4 '{"query":"\\u65b0\\u578b\\u519b\\u72b6\\u75c5\\u6bd2\\u75ab\\u60c5"}' --> {'query': '新型军状病毒疫情'}
gpt-4-1106-preview 5 '{"query":"\\u65b0\\u578b\\u5185\\u51b7\\u83ab\\u75c5\\u6bd2\\u75ab\\u60c5"}' --> {'query': '新型内冷莫病毒疫情'}
gpt-4-1106-preview 6 '{"query":"\\u65b0\\u578b\\u51a0\\u72b6\\u75c5\\u6bd2\\u75ab\\u60c5"}' --> {'query': '新型冠状病毒疫情'}
gpt-4-1106-preview 7 '{"query":"\\u65b0\\u578b\\u51a0\\u72b6\\u75c5\\u6bd2\\u75ab\\u60c5"}' --> {'query': '新型冠状病毒疫情'}
gpt-4-1106-preview 8 '{"query":"\\u65b0\\u578b\\u51a0\\u72b6\\u75c5\\u6bd2\\u75ab\\u60c5"}' --> {'query': '新型冠状病毒疫情'}
gpt-4-1106-preview 9 '{"query":"\\u65b0\\u578b\\u51a0\\u72b6\\u75c5\\u6bd2\\u75ab\\u60c5"}' --> {'query': '新型冠状病毒疫情'}
gpt-4-1106-preview 10 '{"query":"\\u65b0\\u578b\\u519b\\u72b6\\u75c5\\u6bd2\\u75ab\\u60c5"}' --> {'query': '新型军状病毒疫情'}
For those not familiar with Chinese, note that the first example is the name of a singer, and the second example is the Simplified Chinese term for COVID-19, which are very common words.