Gpt-4-1106-preview is not generating utf-8

miguelwon · November 8, 2023, 12:02pm

I’m getting a lot of generation from gpt-4-1106-preview not encoded in utf-8. Example (in Portuguese):

“tapetes de entrada com padrÃµes geomÃ©tricos”

That it correspond to:

b’tapetes de entrada com padr\xc3\x83\xc2\xb5es geom\xc3\x83\xc2\xa9tricos’

Is anyone also experience these outputs?

matthias_kern · November 8, 2023, 1:32pm

I’ve noticed the same, also occurs with function calling and producing JSON.

owencmoore · November 27, 2023, 10:55pm

This is a known bug, we have implemented a fix but it hasn’t rolled out yet. It will be fixed shortly, but until then you will need to identify and handle this encoding issue yourself unfortunately.

chpn.cyril · December 3, 2023, 1:23pm

@owencmoore please see #478500

This is not fixed at all, and the trouble is misidentified. Model is actually not “not returning UTF-8”; but rather corrupting the strings, making any “fix ourself” logic impossible in the process.

louis.antonini · December 3, 2023, 10:36pm

So far, I’ve experienced the following since Dec-2 2024 (Europe/Stockholm)

For tools function calls w/ gpt-3.5-turbo, the responses (function calls) contain characters like Ã¤ and Ã¶ , which are typical artifacts of encoding mismatches, particularly involving UTF-8 and ISO-8859-1 (or Windows-1252 or even Latin-1) encoding (can be fixed by decoding Latin-1 and reencoding to UTF-8).

For tools function calls w/ gpt-4-1106-preview, the responses (function calls) omit UTF-8 characters (hard to fix).

So far, I’ve only seen such issues using function as tools.
NB. I have stopped using non tools functions since tools came out

267626850905yi · December 3, 2023, 11:31pm

me too

seems a new bug after the latest release

louis.antonini · December 4, 2023, 1:21am

You’re using gpt-3.5-turbo-1106, from what I have seen using functions, the problem is that the output is a Latin-1 encoded string claiming to be UTF-8

What you can try is to convert that string to bytes (that function will typically take an encoding parameter, which you can use to specify that this string is a ISO-8859-1 string). Once you got the bytes, convert them back to a string but this time, a UTF-8 string. Problem solved.

Here’s an example in java:
byte bytes = inStr.getBytes(StandardCharsets.ISO_8859_1);
String outStr = new String(bytes, StandardCharsets.UTF_8);

–

Regarding gpt-4 (the problem seem different there but also about function call),

I’m trying to replace unicode chars with their unicode representation using a library List of Unicode characters - Wikipedia and see if GPT-4 function calls interpret the input better than what I have now.

The result will depend on how such representation is handled GPT but my hope is they are left as-is or correctly interpreted.

logankilpatrick · February 17, 2024, 7:05pm

Hey all, as of January 25th 2024, we have resolve this bug with the latest model releases: New embedding models and API updates

Thank you for being patient with us as we worked to address this at the model level.

Topic		Replies	Views
When I use the latest 'gpt-4-1106-preview' model, the model generates functions or tool parameters. If there is Chinese, there may be garbled Chinese characters. What is the reason for this? Bugs gpt-4	3	1820	December 6, 2023
Support of unicode in gpt4-1106-preview Bugs gpt-4 , api	10	2360	November 15, 2024
Asking for Spanish text gpt-3.5-turbo-1106 sends back weird symbols API	1	1134	December 4, 2023
Issue with non-English output from gpt-4-turbo-2024-04-09 Bugs gpt-4 , gpt-4-turbo	0	411	May 20, 2024
Accents missing in function arguments - GPT 3.5 Model 1106 Bugs 1106	2	1052	December 26, 2023

Gpt-4-1106-preview is not generating utf-8

Related topics