GPT 4.1 Character Encoding Issues?

Hi all,

I assume the problem is on my end somehow, but I currently don’t understand why / where:

For generating business document drafts, I’m using OpenAI Chat Completions API with structured output (using the Python API library and Pydantic) successfully since months with two different versions of the GPT 4o model.

Experimentally I switched to GPT 4.1 (full model) after it was announced. The actual content of the generated document seems to be noticeably better than before, possibly due to the better instruction following capabilities and better understanding of what NOT to write.

However, what I never experienced with GPT 4o before: GPT 4.1 causes lots of messed up output encoding, returning garbled characters instead of proper UTF-8 encoded characters whenever non ASCII characters occur in the text. It does not happen always (i.e. not for all generated documents), but much too often, and I also was’t able to solve it with specific prompting (to make it pay attention to encoding).

It mainly happens in a step after I feed back a generated first document draft for an additional review round with slightly lower temperature (0.8) (to make it cross check that it properly incorporated the instructions into the generated document). While the version I got in the previous step still seems to have proper special characters, the reviewed version I get back after this step has garbled ones.

So I suspect that I possibly mess up the information I provide for review, but I’m really just taking the text I got from the parsed JSON and feed it back togehter with a new user message. It’s also exactly the same I did with GPT 4o all the time and which never caused any issues until I switched to 4.1…

For debugging / monitoring I’m logging the information I’m about to send back, and it looks ok - though I’m aware that it’s often difficult to identify encoding issues reliably just by looking at logging output. But Python uses unicode strings internally, my console is UTF-8, and the json.dumps output looks fine.

In my frustration I also upgraded the OpenAI API Python lib to the current version, but no changes.

Did anyone else notice changed behaviour in regard to character encodings?

Unfortunately, I cannot provide real examples here.

5 Likes

Thank you for bringing this up.

I’m running into the exact same issue with gpt-4.1 while gpt-4o works perfectly.

gpt-4.1 will generate structured output like this:

{"evaluation": "Die Antwort ist sehr kurz. Um sicherzugehen, dass alle Aspekte von Frage 1 abgedeckt sind, sollte der Interviewer noch nachfragen, ob der Teilnehmer Online-Banking nutzt oder andere digitale Tools zur \u000cberwachung von Ausgaben oder zum Sparen verwendet. Ziel ist es, ein vollst\u0000e4ndiges Bild vom aktuellen Setup zu bekommen.", "action":  …

Note the \u000cb which should actually be \u00dc (“Ü”) and \u0000e4 which should be \u00e4 (“ä”).

I noticed this issue with German umlauts (example above) but also special characters in other languages such as Spanish.

I have not found a work around for this problem yet.

1 Like

@pen: At least good to hear I’m not the only one stumbling across this - I already had severe doubts, as I could not find any other reports regarding this on the Web.

I also performed a bunch of further tests today - I also observe the \u character references with at least one superfluous zero digit which messes up the codepoint reference. (Besides the encoded numeric value being incorrect, which I didn’t check for so far, but which I think is well possible.)

I also frequently see \x1f instead of the proper character. It also does not only happen in the review step, but also in the initial generation step - the review step just seems to increase the likelyhood that it goes off the rails and produces garbage character encoding.

My structured output has three fields, and the incorrect encodings don’t always occur in all immedately - sometimes, only one or two of the fields are incorrectly encoded after the generation step, while the remaining field is fine. However when I inject this information back for the review step, it seems to “contaminate” the processing and everything will be messed up after the review step. :frowning:

Also there seem to be specific inputs which trigger this behaviour with a very high likelyhood, while all will work well in many other cases. I couldn’t really identify anything special with these inputs though - I noticed that a bunch of my source data accidentally had zero width space characters embedded at certain places, but removing these didn’t improve things - they don’t seem to have caused harm as far as I can currently tell, and were not the culprit.

I tried normalizing all strings to unicode NFC normal form before submitting it to the Chat API, but this also didn’t help. (Maybe it does something like that internally already as a part of the tokenization step - would at least make sense in order not to artificially increase the number of ways the same text will be tokenized.)

2 Likes

The mod team has passed this on to OpenAI for further investigation.

Thanks for reporting!

2 Likes

Yeah, me too! I discovered this issue Friday afternoon and was surprised to see no posts about it anywhere given the frequency I ran into it. I created an account here today to report it myself and discovered your post instead :smiley:

Much appreciated!

1 Like

Exactly the same for me, I also created my account just today. :joy:

As the issue seems to arise due to incorrect JSON unicode escape sequences and JSON is usually (and can be in any case) UTF-8 anyway, I just tried the following as a workaround in the part of the prompt which requests JSON output for the structured output feature (only the second instruction is new):

Gib das Ergebnis im JSON-Format zurück.

Verwende direkt UTF-8-Unicode-Zeichen, keine Escape-Codes.

At first glance this actually seems to have helped! I didn’t test it thoroughly yet, but three “difficult” documents (aka three documents apparently based on “difficult” input / source data) were generated just fine with this on the first attempt.

Maybe worth a shot.

Well, sorry this is the reason you found us, but I recommend searching the forums and sticking around. We’ve got a wealth of information and some great people here!

Thanks for sharing. Just the type of person we want around here! :wink:

Good idea and thanks for sharing. I added this to my prompt and does indeed seem to reduce the likelihood of invalid escape sequences:

Use UTF-8 characters for string values in the JSON response and never use escape characters. For example, use "ä" instead of "\\u00e4" or "\\xe4".

I don’t trust it enough to use it in production though :smiley:

2 Likes

4.o is pretty good. Sticking with it till they iron out the bugs. Been doing that that for any platform for the last 20 years. Old school as it is, still a pretty solid production lifecycle. Thanks for all the feedback guys. Exciting times. Testing out these new models certainly been entertaining :smiley:

It’s been a while so I thought I’d give this another go to see if this was maybe silently fixed. Unfortunately, the problem still persists.

@pen: I think the models behind model names with an exact date never change.

Also, I doubt that it’s that easy to fix such a strange problem - looks as if the model somehow learned to complete with incorrect tokens, and to my knowledge you cannot easily “unlearn” incorrect predictions in transformer neural networks.

However, the “trick” or workaround with requesting the model to use unicode characters directly instead of escape sequences works quite well, according to my observations since then. It’s not 100% fool proof, however the usual “unexpected input completion” issues (regarding the actual text content) overweight the incorrect escape sequence issue by far.

With the instructions to avoid escape sequences altogether, the model only very sporadically and very rarely gets it wrong. These undesired characters can be filtered, and the worst result is that very rarely one or the other umlaut character is missing - like small, rare typos, which the models make anyway sometimes.

This whole LLM business is no exact science by far…

Yeah, I am not sure if OpenAI does any fine-tuning after releasing model versions to the API. Speaking of, the problem might be fixable using the fine-tuning API (https://platform.openai.com/docs/guides/fine-tuning)? I have not experimented with that yet.

Given how frequently I see bad encoding, I am honestly surprised how few reports of the issue there are. I guess the combination of structured output with non-English language is just really not that common?

I also wondered. Looks like - or maybe German umlauts specifically trigger the issue, so the group of affected users is smaller?

Still given the number of users the OpenAI API surely has, I’d also have expected much more noise / complaints regarding that issue…

Or maybe we’re doing something specific to trigger it, without being aware of this…

No, I have seen it triggered with other languages as well. At the very least Spanish and Swedish are also affected.

Swedish uses the same umlauts as German but it also breaks with other characters, e.g.

 "Deltagaren har svarat p\u0000e5 fr\u0000e5ga 3."
  • \u0000e5 → intended to be å → correct escape is \u00e5

For Spanish:

"El participante ha explicado que la forma de la botella y el tipo de destilado, especialmente si es m\u0000e1s a\u0000f1ejo, …"
  • \u0000e1 → intended to be á → correct escape is \u00e1
  • \u0000f1 → intended to be ñ → correct escape is \u00f1

We’ve been working extensively with multi-language LLM outputs and have encountered similar encoding issues across Finnish, Swedish, German, and other languages with non-ASCII characters.

The patterns we’ve identified align exactly with what you’re seeing:

  • Finnish/Nordic: ‘e4’ → ‘ä’, ‘f6’ → ‘ö’, ‘e5’ → ‘å’
  • Swedish: \u0000e5 → intended to be å
  • German umlauts: \u0000e4 → intended to be ä
  • Euro symbol corruption: e282ac → €

After some experimentation with regex-based approaches, we found a surprisingly effective solution: using a lightweight LLM to correct these encoding issues.

We created a function that:

  1. Detects problematic patterns (null bytes, specific character sequences)
  2. Sends corrupted text to an LLM (e.g. gpt-4.1-nano)

The solution works remarkably well across languages and formats (including nested JSON), with ~95% correction rate in our tests.

Here’s a simplified version of our approach:

def fix_with_llm(text, model_name="gpt-4.1-mini"):
    prompt = f"""Fix the corrupted text below that has character encoding issues.
Focus on these specific problems:
1. Finnish/Nordic characters: 'e4' → 'ä', 'f6' → 'ö', 'e5' → 'å'
2. Null bytes (\\u0000 or \u0000) that should be removed
3. Euro symbol corruption (e282ac → €)
4. Range notation like '510' → '5–10' with proper en-dash
5. Invalid escape sequences like \\x, \\e, or incomplete Unicode escapes

PRESERVE all formatting including code blocks, links, emphasis.

Text to fix:
{text}"""

    # Use function calling to get clean output (LangChain)
    llm = ChatOpenAI(
        temperature=0,
        model=model_name,
        model_kwargs={
            "tools": [{
                "type": "function",
                "function": {
                    "name": "return_fixed_text",
                    "description": "Return the fixed text without commentary",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "fixed_text": {
                                "type": "string",
                                "description": "The corrected text"
                            }
                        },
                        "required": ["fixed_text"]
                    }
                }
            }],
            "tool_choice": {"type": "function", "function": {"name": "return_fixed_text"}}
        }
    )
    
    response = llm.invoke(prompt)
    return response.tool_calls[0].get("args", {}).get("fixed_text")

We’ve found the LLM approach particularly valuable because:

  1. It’s language-agnostic and handles multiple corruption patterns simultaneously
  2. It preserves formatting/structure (markdown, JSON, etc.)
  3. It’s more robust than regex for complex cases
  4. It catches edge cases that would require many custom rules

It very well might seem like overkill to use an LLM for string correction, but the simplicity and robustness made it worthwhile for surprisingly high reliability in production.

Hope this is helpful!

2 Likes

@mendel: I also noticed different corruptions, like only creating \u0000 sequences in place of the intended character.

For me, the approach with “just” telling the model not to output escape sequences at all, but the characters directly, works quite reliably, and it does not require extra “rounds” of processing.

Thanks @Gunter! we actually did incorporate that helpful tip, which was effective in reducing the volume of corruption, but perhaps due to the nuance of our workflow, it didn’t eliminate it entirely.

1 Like