Support of unicode in gpt4-1106-preview

guohaochuan · November 8, 2023, 11:10am

Hi,

we’re experimenting with gpt4-1106-preview using function call. and discovered the following:

if the user input is non ascii, the output will often be multi-level escaped unicode characters, which is a headache dealing with global audience…

example:

using prompt:

jinjoong.kim · November 15, 2023, 11:15am

I have exactly the same problem.

I think it’s a similar problem below replied by @atty-openai

/t/gpt-4-1106-preview-messes-up-function-call-parameters-encoding/478500/2

It works right once I translated all Korean characters into English alphabets and remove all line-feed characters.

Please report this @atty-openai, Thanks!

rick.carlino · December 3, 2023, 2:42pm

I am experiencing similar issues with output that has Korean/English mixed output.
In the output below, ' g c g e e c h e w a s n e w .' is supposed to be Korean writing.

The student translated ' g c g e e c h e w a s n e w .' as
'They shared their experience.'

Originally, my prompt surrounded user input in <<brackets like this>> and it caused significantly worse output.

rick.carlino · January 27, 2024, 2:00pm

The problem is fixed in the latest update.

kewang · April 24, 2024, 4:39am

I also encounter the similar issue currently, like this output always \\td0a5 infinitely.

Usually input language is korean.

dignity_for_all · April 24, 2024, 12:04pm

This issue is with the gpt-4-1106-preview model and has been fixed in models after gpt-4-0125-preview.

Please refer to this for more information.

alvarolb · September 18, 2024, 6:36am

Is this bug fixed? I am finding problems on gpt-4o-2024-08-06. It encodes accents as unicode characters, i.e., de programaci\u00f3n.

My request is expecting a response with a JSON schema, which is OK, but some text fields on it are being escaped.

dwkoziris · November 15, 2024, 6:48pm

Also seeing this with gpt-4o where < and % in sql code blocks are being escaped as \u0003 and \u0025

_j · November 15, 2024, 9:59pm

Are you running them through a JSONifier?

JSON Generation: When JSON is generated, certain characters are escaped to ensure the resulting string is a valid JSON. For example, the < character might be escaped as \u003c and % as \u0025. This is standard behavior in many JSON libraries to prevent potential security issues like XSS attacks.

dwkoziris · November 15, 2024, 10:27pm

I don’t believe so since the behavior is nondeterministic (only sometimes does this with certain sql queries) which is why I suspect it might be a quirk of the model.

_j · November 15, 2024, 11:37pm

You can run the strings through the Python codecs.decode library or similar, which should make attempts to decode escape sequences. The AI may just want to write escapes. You can see if they are preventing by demoting \ and \u as logit_bias parameter after using a tokenizer to get the cl100k_base token numbers, or similar instruction saying what should be avoided as output.

Topic		Replies	Views
When I use the latest 'gpt-4-1106-preview' model, the model generates functions or tool parameters. If there is Chinese, there may be garbled Chinese characters. What is the reason for this? Bugs gpt-4	3	1796	December 6, 2023
Gpt-4-1106-preview is not generating utf-8 API gpt-4-turbo	8	8945	February 17, 2024
GPT API Failed to create completion as the model generated invalid Unicode output API gpt-35-turbo , api	3	3527	April 1, 2024
1106 tool use, function parameters encounter Chinese garbled characters API api	2	1226	December 2, 2023
Issue with non-English output from gpt-4-turbo-2024-04-09 Bugs gpt-4 , gpt-4-turbo	0	403	May 20, 2024

Support of unicode in gpt4-1106-preview

Related topics