Gpt-4-1106-preview is not generating utf-8

I’m getting a lot of generation from gpt-4-1106-preview not encoded in utf-8. Example (in Portuguese):

“tapetes de entrada com padrões geométricos”

That it correspond to:

b’tapetes de entrada com padr\xc3\x83\xc2\xb5es geom\xc3\x83\xc2\xa9tricos’

Is anyone also experience these outputs?

I’ve noticed the same, also occurs with function calling and producing JSON.

This is a known bug, we have implemented a fix but it hasn’t rolled out yet. It will be fixed shortly, but until then you will need to identify and handle this encoding issue yourself unfortunately.

3 Likes

@owencmoore please see #478500

This is not fixed at all, and the trouble is misidentified. Model is actually not “not returning UTF-8”; but rather corrupting the strings, making any “fix ourself” logic impossible in the process.

3 Likes

So far, I’ve experienced the following since Dec-2 2024 (Europe/Stockholm)

For tools function calls w/ gpt-3.5-turbo, the responses (function calls) contain characters like ä and ö , which are typical artifacts of encoding mismatches, particularly involving UTF-8 and ISO-8859-1 (or Windows-1252 or even Latin-1) encoding (can be fixed by decoding Latin-1 and reencoding to UTF-8).

For tools function calls w/ gpt-4-1106-preview, the responses (function calls) omit UTF-8 characters (hard to fix).

So far, I’ve only seen such issues using function as tools.
NB. I have stopped using non tools functions since tools came out

me too

seems a new bug after the latest release

You’re using gpt-3.5-turbo-1106, from what I have seen using functions, the problem is that the output is a Latin-1 encoded string claiming to be UTF-8 :laughing:

What you can try is to convert that string to bytes (that function will typically take an encoding parameter, which you can use to specify that this string is a ISO-8859-1 string). Once you got the bytes, convert them back to a string but this time, a UTF-8 string. Problem solved.

Here’s an example in java:
byte bytes = inStr.getBytes(StandardCharsets.ISO_8859_1);
String outStr = new String(bytes, StandardCharsets.UTF_8);

Regarding gpt-4 (the problem seem different there but also about function call),

I’m trying to replace unicode chars with their unicode representation using a library List of Unicode characters - Wikipedia and see if GPT-4 function calls interpret the input better than what I have now.

The result will depend on how such representation is handled GPT but my hope is they are left as-is or correctly interpreted.

Hey all, as of January 25th 2024, we have resolve this bug with the latest model releases: New embedding models and API updates

Thank you for being patient with us as we worked to address this at the model level.

5 Likes