Wrong encoding for gpt-4o during API Chat completion

dmm1 · May 15, 2024, 10:51am

What ‘gpt-4o’ returns (wrong):
jordtillæggende

What ‘gpt-4-turbo’ returns (correct):
jordtillæggende

We had to go back to gpt-4-turbo due to this issue. Any1 else had this encoding experience for omni API ChatCompletion call?

For more info:

Using function calling
Using JSON Mode (Response format is json_object)
In our case it’s danish letters æ, ø, å (capital: Æ, Ø, Å), possibly more besides danish letters.

_j · May 15, 2024, 11:01am

Perhaps if you use the forum’s preformatted text option </>, we can see a better representation of what is different about them…

_j · May 15, 2024, 11:25am

I see your edit. Theory: AI training on web scrapes instead of properly encoded literature has made this something plausible for the AI to write. You can use a low top_p:0.1 API parameter and see if this was just random or instead quite likely from the input you sent.

Let’s make an AI fix its mistakes:

It appears that the issue with the character encoding involves HTML character entities being used instead of directly using the characters. In your example, æ represents the character “æ” in HTML. This can happen when text is improperly encoded into HTML entities instead of being transmitted or stored as plain UTF-8 text.

To address this issue, you can use Python’s html module to unescape these HTML entities back into their correct UTF-8 characters. Here’s how you could do it:
import html

def decode_html_entities(text):
    return html.unescape(text)

# Example usage
encoded_string = "jordtill&#230;ggende"
decoded_string = decode_html_entities(encoded_string)
print(decoded_string)
This code uses the html.unescape() function, which is designed to convert HTML entities back into the corresponding characters. This will render “jordtillæggende” correctly from “jordtillæggende”.

A very similar problem was seen in bad training of gpt-4-1106, but only when it called functions. It may be JSON mode that is interfering with production of strings that would be invalid if unescaped or not encoded as UTF-8 bytes. Or all the web scrapings of training data had not indicated the header for accepting UTF-8.

“&#” is token number 23974 in o200k_base encoding of GPT-4o. And then “&” alone is token 5 if the AI tries to write this a different way. This HTML output prefix could be discouraged with a logit_bias, to see what then arises when the AI can no longer write HTML entities from your input.

Or simply instruct the AI that all UTF-8 single-byte non-accented characters must be escaped in JSON output, such as \u00e6 for this character byte.

>>>print('\u00e6')
æ

Topic		Replies	Views
Gpt-4-1106-preview is not generating utf-8 API gpt-4-turbo	8	8687	February 17, 2024
Gpt-4-1106-preview edit shortcomings API	8	969	January 12, 2024
Api does not support utf-8 encoding API	11	12262	March 20, 2024
Support of unicode in gpt4-1106-preview Bugs gpt-4 , api	10	2163	November 15, 2024
Weird characters like Ø±Ð´Ñ in ouput when doing translation API	5	1634	December 24, 2023

Wrong encoding for gpt-4o during API Chat completion

Related topics