Wrong encoding for gpt-4o during API Chat completion

What ‘gpt-4o’ returns (wrong):
jordtillæggende

What ‘gpt-4-turbo’ returns (correct):
jordtillæggende

We had to go back to gpt-4-turbo due to this issue. Any1 else had this encoding experience for omni API ChatCompletion call?

For more info:

  • Using function calling
  • Using JSON Mode (Response format is json_object)
  • In our case it’s danish letters æ, ø, å (capital: Æ, Ø, Å), possibly more besides danish letters.

image

Perhaps if you use the forum’s preformatted text option </>, we can see a better representation of what is different about them…

1 Like

I see your edit. Theory: AI training on web scrapes instead of properly encoded literature has made this something plausible for the AI to write. You can use a low top_p:0.1 API parameter and see if this was just random or instead quite likely from the input you sent.

Let’s make an AI fix its mistakes:

It appears that the issue with the character encoding involves HTML character entities being used instead of directly using the characters. In your example, &#230; represents the character “æ” in HTML. This can happen when text is improperly encoded into HTML entities instead of being transmitted or stored as plain UTF-8 text.

To address this issue, you can use Python’s html module to unescape these HTML entities back into their correct UTF-8 characters. Here’s how you could do it:

import html

def decode_html_entities(text):
    return html.unescape(text)

# Example usage
encoded_string = "jordtill&#230;ggende"
decoded_string = decode_html_entities(encoded_string)
print(decoded_string)

This code uses the html.unescape() function, which is designed to convert HTML entities back into the corresponding characters. This will render “jordtillæggende” correctly from “jordtillæggende”.


A very similar problem was seen in bad training of gpt-4-1106, but only when it called functions. It may be JSON mode that is interfering with production of strings that would be invalid if unescaped or not encoded as UTF-8 bytes. Or all the web scrapings of training data had not indicated the header for accepting UTF-8.

&#” is token number 23974 in o200k_base encoding of GPT-4o. And then “&” alone is token 5 if the AI tries to write this a different way. This HTML output prefix could be discouraged with a logit_bias, to see what then arises when the AI can no longer write HTML entities from your input.

Or simply instruct the AI that all UTF-8 single-byte non-accented characters must be escaped in JSON output, such as \u00e6 for this character byte.

>>>print('\u00e6')
æ