GPT 4.1 Character Encoding Issues?

Gunter · April 21, 2025, 8:48am

Hi all,

I assume the problem is on my end somehow, but I currently don’t understand why / where:

For generating business document drafts, I’m using OpenAI Chat Completions API with structured output (using the Python API library and Pydantic) successfully since months with two different versions of the GPT 4o model.

Experimentally I switched to GPT 4.1 (full model) after it was announced. The actual content of the generated document seems to be noticeably better than before, possibly due to the better instruction following capabilities and better understanding of what NOT to write.

However, what I never experienced with GPT 4o before: GPT 4.1 causes lots of messed up output encoding, returning garbled characters instead of proper UTF-8 encoded characters whenever non ASCII characters occur in the text. It does not happen always (i.e. not for all generated documents), but much too often, and I also was’t able to solve it with specific prompting (to make it pay attention to encoding).

It mainly happens in a step after I feed back a generated first document draft for an additional review round with slightly lower temperature (0.8) (to make it cross check that it properly incorporated the instructions into the generated document). While the version I got in the previous step still seems to have proper special characters, the reviewed version I get back after this step has garbled ones.

So I suspect that I possibly mess up the information I provide for review, but I’m really just taking the text I got from the parsed JSON and feed it back togehter with a new user message. It’s also exactly the same I did with GPT 4o all the time and which never caused any issues until I switched to 4.1…

For debugging / monitoring I’m logging the information I’m about to send back, and it looks ok - though I’m aware that it’s often difficult to identify encoding issues reliably just by looking at logging output. But Python uses unicode strings internally, my console is UTF-8, and the json.dumps output looks fine.

In my frustration I also upgraded the OpenAI API Python lib to the current version, but no changes.

Did anyone else notice changed behaviour in regard to character encodings?

Unfortunately, I cannot provide real examples here.

pen · April 21, 2025, 5:18pm

Thank you for bringing this up.

I’m running into the exact same issue with gpt-4.1 while gpt-4o works perfectly.

gpt-4.1 will generate structured output like this:

{"evaluation": "Die Antwort ist sehr kurz. Um sicherzugehen, dass alle Aspekte von Frage 1 abgedeckt sind, sollte der Interviewer noch nachfragen, ob der Teilnehmer Online-Banking nutzt oder andere digitale Tools zur \u000cberwachung von Ausgaben oder zum Sparen verwendet. Ziel ist es, ein vollst\u0000e4ndiges Bild vom aktuellen Setup zu bekommen.", "action":  …

Note the \u000cb which should actually be \u00dc (“Ü”) and \u0000e4 which should be \u00e4 (“ä”).

I noticed this issue with German umlauts (example above) but also special characters in other languages such as Spanish.

I have not found a work around for this problem yet.

Gunter · April 21, 2025, 7:23pm

@pen: At least good to hear I’m not the only one stumbling across this - I already had severe doubts, as I could not find any other reports regarding this on the Web.

I also performed a bunch of further tests today - I also observe the \u character references with at least one superfluous zero digit which messes up the codepoint reference. (Besides the encoded numeric value being incorrect, which I didn’t check for so far, but which I think is well possible.)

I also frequently see \x1f instead of the proper character. It also does not only happen in the review step, but also in the initial generation step - the review step just seems to increase the likelyhood that it goes off the rails and produces garbage character encoding.

My structured output has three fields, and the incorrect encodings don’t always occur in all immedately - sometimes, only one or two of the fields are incorrectly encoded after the generation step, while the remaining field is fine. However when I inject this information back for the review step, it seems to “contaminate” the processing and everything will be messed up after the review step.

Also there seem to be specific inputs which trigger this behaviour with a very high likelyhood, while all will work well in many other cases. I couldn’t really identify anything special with these inputs though - I noticed that a bunch of my source data accidentally had zero width space characters embedded at certain places, but removing these didn’t improve things - they don’t seem to have caused harm as far as I can currently tell, and were not the culprit.

I tried normalizing all strings to unicode NFC normal form before submitting it to the Chat API, but this also didn’t help. (Maybe it does something like that internally already as a part of the tokenization step - would at least make sense in order not to artificially increase the number of ways the same text will be tokenized.)

PaulBellow · April 21, 2025, 7:24pm

The mod team has passed this on to OpenAI for further investigation.

Thanks for reporting!

pen · April 21, 2025, 8:23pm

Yeah, me too! I discovered this issue Friday afternoon and was surprised to see no posts about it anywhere given the frequency I ran into it. I created an account here today to report it myself and discovered your post instead

Much appreciated!

Gunter · April 21, 2025, 8:52pm

Exactly the same for me, I also created my account just today.

As the issue seems to arise due to incorrect JSON unicode escape sequences and JSON is usually (and can be in any case) UTF-8 anyway, I just tried the following as a workaround in the part of the prompt which requests JSON output for the structured output feature (only the second instruction is new):

Gib das Ergebnis im JSON-Format zurück.

Verwende direkt UTF-8-Unicode-Zeichen, keine Escape-Codes.

At first glance this actually seems to have helped! I didn’t test it thoroughly yet, but three “difficult” documents (aka three documents apparently based on “difficult” input / source data) were generated just fine with this on the first attempt.

Maybe worth a shot.

PaulBellow · April 21, 2025, 8:59pm

Well, sorry this is the reason you found us, but I recommend searching the forums and sticking around. We’ve got a wealth of information and some great people here!

Thanks for sharing. Just the type of person we want around here!

pen · April 22, 2025, 3:41pm

Good idea and thanks for sharing. I added this to my prompt and does indeed seem to reduce the likelihood of invalid escape sequences:

Use UTF-8 characters for string values in the JSON response and never use escape characters. For example, use "ä" instead of "\\u00e4" or "\\xe4".

I don’t trust it enough to use it in production though

Will_Wehi · April 26, 2025, 9:23pm

4.o is pretty good. Sticking with it till they iron out the bugs. Been doing that that for any platform for the last 20 years. Old school as it is, still a pretty solid production lifecycle. Thanks for all the feedback guys. Exciting times. Testing out these new models certainly been entertaining

pen · May 2, 2025, 4:45pm

It’s been a while so I thought I’d give this another go to see if this was maybe silently fixed. Unfortunately, the problem still persists.

Gunter · May 2, 2025, 5:54pm

@pen: I think the models behind model names with an exact date never change.

Also, I doubt that it’s that easy to fix such a strange problem - looks as if the model somehow learned to complete with incorrect tokens, and to my knowledge you cannot easily “unlearn” incorrect predictions in transformer neural networks.

However, the “trick” or workaround with requesting the model to use unicode characters directly instead of escape sequences works quite well, according to my observations since then. It’s not 100% fool proof, however the usual “unexpected input completion” issues (regarding the actual text content) overweight the incorrect escape sequence issue by far.

With the instructions to avoid escape sequences altogether, the model only very sporadically and very rarely gets it wrong. These undesired characters can be filtered, and the worst result is that very rarely one or the other umlaut character is missing - like small, rare typos, which the models make anyway sometimes.

This whole LLM business is no exact science by far…

pen · May 2, 2025, 6:22pm

Yeah, I am not sure if OpenAI does any fine-tuning after releasing model versions to the API. Speaking of, the problem might be fixable using the fine-tuning API (https://platform.openai.com/docs/guides/fine-tuning)? I have not experimented with that yet.

Given how frequently I see bad encoding, I am honestly surprised how few reports of the issue there are. I guess the combination of structured output with non-English language is just really not that common?

Gunter · May 2, 2025, 7:23pm

I also wondered. Looks like - or maybe German umlauts specifically trigger the issue, so the group of affected users is smaller?

Still given the number of users the OpenAI API surely has, I’d also have expected much more noise / complaints regarding that issue…

Or maybe we’re doing something specific to trigger it, without being aware of this…

pen · May 5, 2025, 11:02pm

No, I have seen it triggered with other languages as well. At the very least Spanish and Swedish are also affected.

Swedish uses the same umlauts as German but it also breaks with other characters, e.g.

 "Deltagaren har svarat p\u0000e5 fr\u0000e5ga 3."

\u0000e5 → intended to be å → correct escape is \u00e5

For Spanish:

"El participante ha explicado que la forma de la botella y el tipo de destilado, especialmente si es m\u0000e1s a\u0000f1ejo, …"

\u0000e1 → intended to be á → correct escape is \u00e1
\u0000f1 → intended to be ñ → correct escape is \u00f1

mendel · May 7, 2025, 1:48am

We’ve been working extensively with multi-language LLM outputs and have encountered similar encoding issues across Finnish, Swedish, German, and other languages with non-ASCII characters.

The patterns we’ve identified align exactly with what you’re seeing:

Finnish/Nordic: ‘e4’ → ‘ä’, ‘f6’ → ‘ö’, ‘e5’ → ‘å’
Swedish: \u0000e5 → intended to be å
German umlauts: \u0000e4 → intended to be ä
Euro symbol corruption: e282ac → €

After some experimentation with regex-based approaches, we found a surprisingly effective solution: using a lightweight LLM to correct these encoding issues.

We created a function that:

Detects problematic patterns (null bytes, specific character sequences)
Sends corrupted text to an LLM (e.g. gpt-4.1-nano)

The solution works remarkably well across languages and formats (including nested JSON), with ~95% correction rate in our tests.

Here’s a simplified version of our approach:

def fix_with_llm(text, model_name="gpt-4.1-mini"):
    prompt = f"""Fix the corrupted text below that has character encoding issues.
Focus on these specific problems:
1. Finnish/Nordic characters: 'e4' → 'ä', 'f6' → 'ö', 'e5' → 'å'
2. Null bytes (\\u0000 or \u0000) that should be removed
3. Euro symbol corruption (e282ac → €)
4. Range notation like '510' → '5–10' with proper en-dash
5. Invalid escape sequences like \\x, \\e, or incomplete Unicode escapes

PRESERVE all formatting including code blocks, links, emphasis.

Text to fix:
{text}"""

    # Use function calling to get clean output (LangChain)
    llm = ChatOpenAI(
        temperature=0,
        model=model_name,
        model_kwargs={
            "tools": [{
                "type": "function",
                "function": {
                    "name": "return_fixed_text",
                    "description": "Return the fixed text without commentary",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "fixed_text": {
                                "type": "string",
                                "description": "The corrected text"
                            }
                        },
                        "required": ["fixed_text"]
                    }
                }
            }],
            "tool_choice": {"type": "function", "function": {"name": "return_fixed_text"}}
        }
    )
    
    response = llm.invoke(prompt)
    return response.tool_calls[0].get("args", {}).get("fixed_text")

We’ve found the LLM approach particularly valuable because:

It’s language-agnostic and handles multiple corruption patterns simultaneously
It preserves formatting/structure (markdown, JSON, etc.)
It’s more robust than regex for complex cases
It catches edge cases that would require many custom rules

It very well might seem like overkill to use an LLM for string correction, but the simplicity and robustness made it worthwhile for surprisingly high reliability in production.

Hope this is helpful!

Gunter · May 7, 2025, 8:08am

@mendel: I also noticed different corruptions, like only creating \u0000 sequences in place of the intended character.

For me, the approach with “just” telling the model not to output escape sequences at all, but the characters directly, works quite reliably, and it does not require extra “rounds” of processing.

mendel · May 7, 2025, 11:46am

Thanks @Gunter! we actually did incorporate that helpful tip, which was effective in reducing the volume of corruption, but perhaps due to the nuance of our workflow, it didn’t eliminate it entirely.

tahakucuk · June 3, 2025, 8:05am

Any updates here? Still suffering from this in other languages having special letters with the gpt4.1 model.

_j · June 3, 2025, 7:04pm

Those are byte representations that are not valid UTF-8 code point bytes, and representing them in strings or even in normal writing as hex or escape codes is certainly odd output behavior.

The problem with 8-bit code pages (not ASCII) is that the upper half can change depending on the platform language, changing an accented character into a block drawing character. Windows 95 might have used CP-1252 for English, but then for the Cyrillic or Nordic version, could substitute in another byte-to-display system like cp-850 or multiple ISO specs based on different languages, whatever. ASCII is only 7-bit (values below 0x7e), and the upper half 129-255 is a minefield, now used as an indicator of multi-byte UTF-8.

You can see your e4 => ä here.

So then, we must tell the AI what not to produce from its corpus training or issues in data processing.

# -*- coding: utf-8 -*-

Responses are always natural UTF-8, never using an escape sequence of hex or bytes, nor single-byte code pages. Your render environment and output strings fully-support multi-byte UTF-8 for world languages (while you prefer the ASCII character subset).

You can see if providing instructional clarity along those lines can reduce the need for output processing.

Mohamed_Rebai · June 12, 2025, 12:17pm

Any updates to this? I’m experiencing a similar issue with gpt 4.1 on structured outputs.
If not, has anyone figured out a programmatic way to clean the corrupted characters as a workaround?

Topic		Replies	Views
Structured output with responses API returns tons of \n\n\n\n Bugs responses-endpoint , gpt-41	25	2294	October 23, 2025
Gpt-4-1106-preview messes up function call parameters encoding Bugs	103	21930	February 6, 2024
Responses API returning \t\t\n\n \t\t\n \t\t\n \t\t\n Bugs responses	17	1569	December 1, 2025
The GPT-4-1106-preview model keeps generating "\\n\\n\\n\\n\\n\\n\\n\\n" for an hour when using functions API chatgpt , api	9	2697	December 31, 2023
Json format causes infinite "\n \n \n \n" in response API gpt-4 , api , json-mode	21	10570	April 30, 2025

GPT 4.1 Character Encoding Issues?

Related topics