GPT-4o returning malformed Unicode like \u0000e6 instead of æ — encoding bug?

Hi all,

I’m encountering a recurring encoding issue using GPT‑4o via the Chat API (no streaming). The issue appears when the response contains non-ASCII characters like æ, ø, å. Instead of valid Unicode escape sequences (like \u00e6), I often get corrupted ones such as:

"Jeg har modtaget tilstr\u0000e6kkelig information til ..."

Expected:

"Jeg har modtaget tilstrækkelig information til ..."

This occurs intermittently, even with identical prompts and schema input. It appears before any post-processing on my side — directly in chatResponse.Choices[0].Message.Content.

Example from debugger:

ValueKind = String : 
"{"JobAdCreateStatus":{"wascreated":true,"explanation":"Jeg har modtaget tilstr\u0000e6kkelig information til ..."

I’m using:

  • Model: gpt-4o
  • Endpoint: Chat completion
  • Tool use / function calling with JSON schema
  • SDK: Official .NET OpenAI SDK
  • Streaming: disabled
  • Encoding: UTF‑8 throughout

This behavior does not occur consistently, but often enough that it corrupts production content when characters like æ or ø are used.

Has anyone else seen similar corruption from GPT‑4o?

Any insights or official response would be very helpful :folded_hands:

Thanks,
— Thomas

1 Like

Yes. Another topic:

You’ll have to discuss the issue with the AI: that it should natively output unescaped Unicode UTF-8 language, even into structured output and code (providing the code encoding is Unicode, like Python 3), and also not high-ASCII code page bytes.

You can also put some example outputs in the system message, in the user’s language if known, or alternately some example chat turns (you can use "name":"example" on chat completions alongside role).

You can use logit_bias to demote certain tokens on Chat Completions, in this case, 7570 for “\u”.

image

Neither feature is provided with “Responses”.

You also have gpt-4o-2024-11-20 to try out.

This ultimately needs OpenAI to acknowledge and fix the models. One shouldn’t have to create an extensive fine-tuning to get the AI to properly write a world language.