Codex VSCode - Major issue - LLM Suddendly can't handle encoding

Greetings,

Since around a day ago, I’ve noticed severe bugs when using Codex High Local in VSCode. The model appears to have lost its understanding of encoding rules. Nearly every second task now results in corrupted code across several different projects. After some investigation, I was able to identify what’s happening.

The model is now using methods for reading and writing files that corrupt the encoding and destroy code integrity. This issue occurs not only with PowerShell but also with Python, often leading to completely broken files.

Below is a technical summary of the problem:

Bug Summary: Encoding Corruption When Using PowerShell for File Edits

When modifying project files using PowerShell commands such as Get-Content, Set-Content, -replace, or manual byte conversions, files may become encoding-corrupted.

Technical Cause

PowerShell’s default file encoding on Windows is CP1252 (ANSI), not UTF-8.
When a script reads or writes UTF-8 files (for example, PHP files containing <?php, special characters, or umlauts) using these commands without explicitly specifying UTF-8 encoding, PowerShell silently converts the text to CP1252 and back. This results in invalid byte sequences and corrupted characters.

Example Symptoms

  • UTF-8 files with <?php lose the BOM or break PHP syntax.

  • Umlauts and special characters (ä, ö, ü, ß, etc.) appear as garbled text.

  • Git diffs show massive binary changes even for minor edits.

This also sometimes happens even when I explicitly instruct the model (within the agent system) not to use those unsafe methods — it still does, which suggests something deeper is wrong, possibly in the system prompts or instruction handling.

Escalation of the Issue

Furthermore, the situation becomes even worse when you instruct the model to fix the corrupted files. It does not seem to understand what actually happened and instead continues generating additional scripts that attempt to “convert” parts of the code again. This leads to even more severe corruption, eventually resulting in completely broken files that contain invalid or binary-like content.

I tested this behavior across multiple projects in different programming languages, and the issue appears consistently everywhere.

Severity

This is a critical issue, as it can occur silently, file by file, without any visible warnings. The model gradually destroys codebases until numerous files are corrupted and the project starts producing errors.
If you then ask the model to fix the problem without explicitly explaining the root cause (encoding corruption), it continues to rewrite files incorrectly and can and will do irreversibly damage to large portions of the codebase.

1 Like

@All

Please update the agent instruction to: (this will not fully avoid this). And seek your codebase for this error (multiple times again) to find if it has already been corrupted.

**Encoding Safety:** Do **not** use PowerShell `Get-Content`/`Set-Content`, `-replace`, or ad-hoc byte conversion for editing project files. They default to CP1252 and corrupt UTF-8 files (e.g., `<?php`, umlauts).

Use `apply_patch` or Python (`python - <<‘PY’ … PY`) with explicit UTF‑8 read/write for all file edits.
1 Like