GPT 4.1 Character Encoding Issues?

Gunter · June 12, 2025, 1:06pm

@Mohamed_Rebai: No updates I know of, and I also think you’ll have to wait for a new version of GPT-4.1, if such a thing will be released at all.

Meanwhile, you’ll find some approaches to work arounds above - don’t these work for you?

For me, the pure prompting approach works reliably enough (even though I need to repeat the instructions in every prompt which requests structured output), and I don’t have need for postprocessing / a repair step.

Others use GPT-4.1-nano to fix what GPT-4.1 (without nano) broke… A pure string replacement table probably won’t work, as the output is not fully deterministic and not unique. (I.e. often just \0000, i.e. null-codepoint escapes are returned which provide no information at all regarding which character should really have been there.)

Btw.:

The escape returned for “Ü” in the provided example was actually \u000c, not \u000cb - the “b” came from the verbatim “berwachung”.

\u000c is just a “Form Feed” control character…

Though ä was indeed returned as \u0000e4, which is no valid unicode escape at all - it’s null codepoint \u0000, followed by the verbatim letters e4… So really pretty random corruption.

Mohamed_Rebai · June 12, 2025, 1:25pm

The errors we encountered are not that random (although I’m not sure especially after reading this thread); what we’ve seen so far is always a random control character like (\u0000 or \u0001) followed by the actual hex code (as normal characters).

I’m trying to identify such sequences and correctly interpret them.

Gunter · June 12, 2025, 1:47pm

That’s not always the case as you can see in the examples in this thread. So I would not rely on it if I were you.

Luis_Ernesto_Martine · July 7, 2025, 2:18pm

Hello, everyone.
I’m finding the same issue passing the visible text from a Spanish website. It contains unicode characters (such as áéíóú ñ, but also ™, €) but well-formatted and correctly parsed. Whenever I use the 4o model on any of it’s versions it works perfectly, but with this 4.1 (just sometimes) introduces these \u and \x like characters all over the response.

Any advance on this will be highly appreciated.

Colm_Roche · July 9, 2025, 11:49am

Thanks for flagging this Luis! Would you mind sharing this and any other information you believe to capture the issue to support@openai.com please?

_j · July 9, 2025, 2:57pm

Here’s the information:

OpenAI must train new AI models, ones that use proper data sanitation techniques in training corpus, in distillation data, and in post-training in specific domains.

The AI is producing not just invalid unicode, or invalid bytes, but is trying to represent language as bytes that are escaped.

This is exacerbated in a structured output setting, as the AI, trained more on writing code, may be predisposed to escape “code-like” but inappropriately for JSON.

This can only be attributed to training.

You need to task a data scientist to finding out where neuronal knowledge has incorporated such failings as “\u0000e5”, which is invalid unicode even were it to be decoded, and the ingestion source. An AI writing `0xe4’ → ‘ä’ can only come about from incorporation of single-byte texts corpus that use language-specific code pages that have not been properly translated to native unicode code points.

Then offer new models, just as gpt-4-turbo-1106 had to be replaced because of similar encoding failings.

radsimu_spektr · July 15, 2025, 4:07pm

still having issue to this day with this, with both gpt 4.1 and 4o. I’m doing a websearch with structured output and the results contain messed up utf-8 encodings. A “ö” is represented as \u000f6 (instead of \u00f6) and it is impossible to come up with a solution that rellibably recovers the intended character in all cases. Asking the llm nicely to be more careful with unicode characters seem to make a small difference but not good enough.

Topic		Replies	Views
Gpt-4-1106-preview messes up function call parameters encoding Bugs	103	21306	February 6, 2024
GPT-4o returning malformed Unicode like \u0000e6 instead of æ — encoding bug? Bugs	1	98	July 25, 2025
Support of unicode in gpt4-1106-preview Bugs gpt-4 , api	10	2367	November 15, 2024
Gpt-4-1106-preview is not generating utf-8 API gpt-4-turbo	8	9086	February 17, 2024
Json format causes infinite "\n \n \n \n" in response API gpt-4 , api , json-mode	21	9764	April 30, 2025

GPT 4.1 Character Encoding Issues?

Related topics