Support of unicode in gpt4-1106-preview

Hi,

we’re experimenting with gpt4-1106-preview using function call. and discovered the following:

if the user input is non ascii, the output will often be multi-level escaped unicode characters, which is a headache dealing with global audience…

example:


using prompt:

5 Likes

I have exactly the same problem.

I think it’s a similar problem below replied by @atty-openai

/t/gpt-4-1106-preview-messes-up-function-call-parameters-encoding/478500/2

It works right once I translated all Korean characters into English alphabets and remove all line-feed characters.

Please report this @atty-openai, Thanks!

2 Likes

I am experiencing similar issues with output that has Korean/English mixed output.
In the output below, ' g c g e e c h e w a s n e w .' is supposed to be Korean writing.

The student translated ' g c g e e c h e w a s n e w .' as
'They shared their experience.' 

Originally, my prompt surrounded user input in <<brackets like this>> and it caused significantly worse output.

1 Like

The problem is fixed in the latest update.

I also encounter the similar issue currently, like this output always \\td0a5 infinitely.

Usually input language is korean.

This issue is with the gpt-4-1106-preview model and has been fixed in models after gpt-4-0125-preview.

Please refer to this for more information.

Is this bug fixed? I am finding problems on gpt-4o-2024-08-06. It encodes accents as unicode characters, i.e., de programaci\u00f3n.

My request is expecting a response with a JSON schema, which is OK, but some text fields on it are being escaped.

Also seeing this with gpt-4o where < and % in sql code blocks are being escaped as \u0003 and \u0025

Are you running them through a JSONifier?

  • JSON Generation: When JSON is generated, certain characters are escaped to ensure the resulting string is a valid JSON. For example, the < character might be escaped as \u003c and % as \u0025. This is standard behavior in many JSON libraries to prevent potential security issues like XSS attacks.

I don’t believe so since the behavior is nondeterministic (only sometimes does this with certain sql queries) which is why I suspect it might be a quirk of the model.

You can run the strings through the Python codecs.decode library or similar, which should make attempts to decode escape sequences. The AI may just want to write escapes. You can see if they are preventing by demoting \ and \u as logit_bias parameter after using a tokenizer to get the cl100k_base token numbers, or similar instruction saying what should be avoided as output.