GPT 4.1 Character Encoding Issues?

@Mohamed_Rebai: No updates I know of, and I also think you’ll have to wait for a new version of GPT-4.1, if such a thing will be released at all.

Meanwhile, you’ll find some approaches to work arounds above - don’t these work for you?

For me, the pure prompting approach works reliably enough (even though I need to repeat the instructions in every prompt which requests structured output), and I don’t have need for postprocessing / a repair step.

Others use GPT-4.1-nano to fix what GPT-4.1 (without nano) broke… A pure string replacement table probably won’t work, as the output is not fully deterministic and not unique. (I.e. often just \0000, i.e. null-codepoint escapes are returned which provide no information at all regarding which character should really have been there.)

Btw.:

The escape returned for “Ü” in the provided example was actually \u000c, not \u000cb - the “b” came from the verbatim “berwachung”.

\u000c is just a “Form Feed” control character…

Though ä was indeed returned as \u0000e4, which is no valid unicode escape at all - it’s null codepoint \u0000, followed by the verbatim letters e4… So really pretty random corruption.

2 Likes

The errors we encountered are not that random (although I’m not sure especially after reading this thread); what we’ve seen so far is always a random control character like (\u0000 or \u0001) followed by the actual hex code (as normal characters).

I’m trying to identify such sequences and correctly interpret them.

That’s not always the case as you can see in the examples in this thread. So I would not rely on it if I were you. :slight_smile:

Hello, everyone.
I’m finding the same issue passing the visible text from a Spanish website. It contains unicode characters (such as áéíóú ñ, but also ™, €) but well-formatted and correctly parsed. Whenever I use the 4o model on any of it’s versions it works perfectly, but with this 4.1 (just sometimes) introduces these \u and \x like characters all over the response.

Any advance on this will be highly appreciated.

Thanks for flagging this Luis! Would you mind sharing this and any other information you believe to capture the issue to support@openai.com please?

Here’s the information:

OpenAI must train new AI models, ones that use proper data sanitation techniques in training corpus, in distillation data, and in post-training in specific domains.

The AI is producing not just invalid unicode, or invalid bytes, but is trying to represent language as bytes that are escaped.

This is exacerbated in a structured output setting, as the AI, trained more on writing code, may be predisposed to escape “code-like” but inappropriately for JSON.

This can only be attributed to training.

You need to task a data scientist to finding out where neuronal knowledge has incorporated such failings as “\u0000e5”, which is invalid unicode even were it to be decoded, and the ingestion source. An AI writing `0xe4’ → ‘ä’ can only come about from incorporation of single-byte texts corpus that use language-specific code pages that have not been properly translated to native unicode code points.

Then offer new models, just as gpt-4-turbo-1106 had to be replaced because of similar encoding failings.

still having issue to this day with this, with both gpt 4.1 and 4o. I’m doing a websearch with structured output and the results contain messed up utf-8 encodings. A “ö” is represented as \u000f6 (instead of \u00f6) and it is impossible to come up with a solution that rellibably recovers the intended character in all cases. Asking the llm nicely to be more careful with unicode characters seem to make a small difference but not good enough.

2 Likes