API Response encoding Bug | UTF-8/UTF-16

eggers.mats · February 11, 2024, 11:19am

There are already articles complaining about the problem that the API Response doesn’t display non-english character correctly:

The new Version “gpt-3.5-turbo-0125” should fix this bug as you can read in this article:

But even tho I use the new Version"0125" the problem is still the same, does anyone has a solution for that?

Thanks for your responses!

_j · February 11, 2024, 12:05pm

I can’t fix dummkopf functions, but I can offer a solution.

Bad:

“arguments”: { “name”: “DÃ¶ner”, “instructions”: [ “Schneide das Fleisch in dÃ¼nne Streifen.”, “Mariniere das Fleisch mit GewÃ¼rzen und Joghurt.”, “Brate das Fleisch in einer Pfanne oder auf dem Grill.”, “Schneide das GemÃ¼se und bereite den Salat vor.”, “FÃ¼lle das Fleisch, GemÃ¼se und Salat in das Fladenbrot.”, "FÃ¼ge die SoÃ e hinzu und rolle das Fladenbrot

Füge die Soße hinzu und rolle das Fladenbrot

Fixed by code:


import re

def fix_encoding(s):
    # Define the mapping of incorrect two-character sequences to the correct characters
    correction_map = {
        'Ã¶': 'ö',        'Ã¼': 'ü',
        'Ã¤': 'ä',        'ÃŸ': 'ß',
        'Ã ': 'ß',        'ÃƒÂ¼': 'ü',
        'ÃƒÂ¶': 'ö',        'ÃƒÂ¤': 'ä',
        'ÃƒÂŸ': 'ß',        'ÃƒÂ ': 'À',
        'Ã\x9f': 'ß', # where \x9f represents the invisible character
    }
    
    # Create a regular expression from the map
    regex = re.compile("(%s)" % "|".join(map(re.escape, correction_map.keys())))
    # For each match, look-up corresponding value in dictionary
    return regex.sub(lambda mo: correction_map[mo.string[mo.start():mo.end()]], s)

incorrect_strings = [
    '"arguments": { "name": "DÃ¶ner", "instructions": [ "Schneide das Fleisch in dÃ¼nne Streifen.", "Mariniere das Fleisch mit GewÃ¼rzen und Joghurt.", "Brate das Fleisch in einer Pfanne oder auf dem Grill.", "Schneide das GemÃ¼se und bereite den Salat vor.", "FÃ¼lle das Fleisch, GemÃ¼se und Salat in das Fladenbrot.", "FÃ¼ge die SoÃ e hinzu und rolle das Fladenbrot"'
]
corrected_strings = [fix_encoding(s) for s in incorrect_strings]
print(corrected_strings)

output:

[‘“arguments”: { “name”: “Döner”, “instructions”: [ “Schneide das Fleisch in dünne Streifen.”, “Mariniere das Fleisch mit Gewürzen und Joghurt.”, “Brate das Fleisch in einer Pfanne oder auf dem Grill.”, “Schneide das Gemüse und bereite den Salat vor.”, “Fülle das Fleisch, Gemüse und Salat in das Fladenbrot.”, “Füge die So?e hinzu und rolle das Fladenbrot”’]

You’d have to get the bytes of “ß” Füge die Soße hinzu und rolle das Fladenbrot to ensure success, but there’s a guess in code.

In the other long thread, there might have been a more generic solution still applicable or modifiable.

eggers.mats · February 11, 2024, 6:01pm

Oh men thank you so much! I’ll try to figure out how to use this in Dart since I am using Dart/Flutter but that already helped a lot thank you!!

mauro.cherchi.2 · February 15, 2024, 5:54pm

Thank you!
Here is the list with some more characters if anyone needs it

corrections = {
        'Ã\xa0': 'à', 'Ã¨': 'è', 'Ã©': 'é', 'Ã¬': 'ì', 'Ã²': 'ò', 'Ã³': 'ó', 'Ã¹': 'ù',
        'Ã¤': 'ä', 'Ã¶': 'ö', 'Ã¼': 'ü', 'ÃŸ': 'ß',  
        'Ã¡': 'á', 'Ã': 'í', 'Ã±': 'ñ', 'Ãº': 'ú',
        'Ã¢': 'â', 'Ãª': 'ê', 'Ã«': 'ë', 'Ã®': 'î', 'Ã¯': 'ï', 'Ã´': 'ô', 'Ã»': 'û', 'Ã§': 'ç'
    }

csiebler · April 8, 2024, 11:39am

@eggers.mats Have you been using JSON mode? I’m experiencing the same with 0125, looks like it has only been fixed for “normal” mode.

While the fix above works, it makes it a bit cumbersome when using streaming mode.

csiebler · April 10, 2024, 1:01pm

Just checked with 0409 and it seems that JSON mode works correct again, even with German Umlauts.

Topic		Replies	Views
Api does not support utf-8 encoding API	11	13408	March 20, 2024
Gpt-4-1106-preview is not generating utf-8 API gpt-4-turbo	8	8959	February 17, 2024
Wrong encoding for gpt-4o during API Chat completion Bugs	2	1239	May 15, 2024
GPT 4.1 Character Encoding Issues? Bugs gpt-4-1	16	759	May 7, 2025
Gpt-4-1106-preview edit shortcomings API	8	989	January 12, 2024

API Response encoding Bug | UTF-8/UTF-16

Related topics