I see your edit. Theory: AI training on web scrapes instead of properly encoded literature has made this something plausible for the AI to write. You can use a low top_p:0.1 API parameter and see if this was just random or instead quite likely from the input you sent.
Let’s make an AI fix its mistakes:
It appears that the issue with the character encoding involves HTML character entities being used instead of directly using the characters. In your example, æ represents the character “æ” in HTML. This can happen when text is improperly encoded into HTML entities instead of being transmitted or stored as plain UTF-8 text.
To address this issue, you can use Python’s html module to unescape these HTML entities back into their correct UTF-8 characters. Here’s how you could do it:
import html
def decode_html_entities(text):
return html.unescape(text)
# Example usage
encoded_string = "jordtillæggende"
decoded_string = decode_html_entities(encoded_string)
print(decoded_string)
This code uses the html.unescape() function, which is designed to convert HTML entities back into the corresponding characters. This will render “jordtillæggende” correctly from “jordtillæggende”.
A very similar problem was seen in bad training of gpt-4-1106, but only when it called functions. It may be JSON mode that is interfering with production of strings that would be invalid if unescaped or not encoded as UTF-8 bytes. Or all the web scrapings of training data had not indicated the header for accepting UTF-8.
“&#” is token number 23974 in o200k_base encoding of GPT-4o. And then “&” alone is token 5 if the AI tries to write this a different way. This HTML output prefix could be discouraged with a logit_bias, to see what then arises when the AI can no longer write HTML entities from your input.
Or simply instruct the AI that all UTF-8 single-byte non-accented characters must be escaped in JSON output, such as \u00e6 for this character byte.