I’m getting a lot of generation from gpt-4-1106-preview not encoded in utf-8. Example (in Portuguese):
“tapetes de entrada com padrões geométricos”
That it correspond to:
b’tapetes de entrada com padr\xc3\x83\xc2\xb5es geom\xc3\x83\xc2\xa9tricos’
Is anyone also experience these outputs?
I’ve noticed the same, also occurs with function calling and producing JSON.
This is a known bug, we have implemented a fix but it hasn’t rolled out yet. It will be fixed shortly, but until then you will need to identify and handle this encoding issue yourself unfortunately.
3 Likes
@owencmoore please see #478500
This is not fixed at all, and the trouble is misidentified. Model is actually not “not returning UTF-8”; but rather corrupting the strings, making any “fix ourself” logic impossible in the process.
3 Likes
So far, I’ve experienced the following since Dec-2 2024 (Europe/Stockholm)
For tools function calls w/ gpt-3.5-turbo, the responses (function calls) contain characters like ä
and ö
, which are typical artifacts of encoding mismatches, particularly involving UTF-8 and ISO-8859-1 (or Windows-1252 or even Latin-1) encoding (can be fixed by decoding Latin-1 and reencoding to UTF-8).
For tools function calls w/ gpt-4-1106-preview, the responses (function calls) omit UTF-8 characters (hard to fix).
So far, I’ve only seen such issues using function as tools.
NB. I have stopped using non tools functions since tools came out
me too
seems a new bug after the latest release
You’re using gpt-3.5-turbo-1106, from what I have seen using functions, the problem is that the output is a Latin-1 encoded string claiming to be UTF-8
What you can try is to convert that string to bytes (that function will typically take an encoding parameter, which you can use to specify that this string is a ISO-8859-1 string). Once you got the bytes, convert them back to a string but this time, a UTF-8 string. Problem solved.
Here’s an example in java:
byte bytes = inStr.getBytes(StandardCharsets.ISO_8859_1);
String outStr = new String(bytes, StandardCharsets.UTF_8);
–
Regarding gpt-4 (the problem seem different there but also about function call),
I’m trying to replace unicode chars with their unicode representation using a library List of Unicode characters - Wikipedia and see if GPT-4 function calls interpret the input better than what I have now.
The result will depend on how such representation is handled GPT but my hope is they are left as-is or correctly interpreted.
Hey all, as of January 25th 2024, we have resolve this bug with the latest model releases: New embedding models and API updates
Thank you for being patient with us as we worked to address this at the model level.
5 Likes