Hello OpenAI team, I’m using the OpenAI API to extract and structure addresses from free-text descriptions. I rely on the response_format: json option to ensure clean, machine-readable outputs. However, in many cases, the API is returning malformed or incorrectly encoded characters within the JSON response. For example, instead of returning “São Paulo” or “Guarujá”, I receive: { “estado”: “S\u00050”, “cidade”: “Guaruj\u00101” } These are control characters (\u0005, \u0010, etc.) that corrupt the expected UTF-8 output, making it unusable in production systems. This behavior has been consistent and is severely affecting the reliability of our integration. For context, in this past month alone, our usage statistics are: Total tokens: 1,971,385,774 Total requests: 344,854 We kindly ask for guidance or a fix, as we’re strictly relying on the model’s output for critical address processing and need consistent, UTF-8 clean responses. Thank you in advance for your support. Best regards, Luis
Can you spare more details on what kind of input, model and prompt are you using?
This is quite disturbing. Sometimes it returns the same characters for all letters that do not present in English alphabet. That makes it impossible to find a proper way to fix corrupted characters. I expect a solution for this for a long time.
You do not define “it” here. However, if you are using language models that have such a symptom, you can try with json_object as the type of text format for response, and a system prompt that describes the schema to the AI model, then employ the API model names:
gpt-4-turbo
gpt-4
gpt-4-2024-05-13
gpt-5
As a class of model not affected in the same way by this pervasive unacknowledged and apparently unfixable issue that has dozens of reports on the API of it continuing, to where the current o4 and 4.1 models have to be abandoned when using accented world languages that may have originally used alternate ASCII code pages.
I don’t know if this has been solved, but I’m running into this when having gpt5.1 visually OCR documents. This document reliably hits it every time:
[Note I need to figure out how to post a link, it’s blocking…]
the server is media-cdn dot rollcall dot com. It’s in a folder called pdfs and the file is called example-page-5.pdf
When I try to have it do the OCR structured it dies at the same place (note I removed some things to slim it down, and the past is curly quoting things but you can see where it breaks:
JSON Decode Error: Unterminated string starting at: line 1 column 2831 (char 2830) { “filename”: “example-page-5.pdf”, “category”: “Court Document”, “subcategory”: “Exhibit”, “date”: “2021-05-20”, “summary”: “[Removed]”, “email”: false, “sections”: [ { “section_number”: 1, “title”: “AK’s recollection of investigations…”, }, { “section_number”: 2, “title”: “AK’s…” } ], “ocr”: [ { “sequence”: 1, “page”: 1, “section”: 1, “text”: "Sean to call AK back and let her know if the Florida agents had concerns. Seannever called AK back. AK doesn’t recall ever affirmatively following up with Sean, but she took the radio silence to
When I tried doing it in the desktop client to repeat, to see what was going on, it said it was doing Tessaract (I believe 5.1 does it visually as I don’t see code if I trace, but
, and here’s what it output…
i
import pytesseract
from PIL import Image
import fitz
open pdf
pdf_path=“/mnt/data/DOJ-OGR-00004180-00004186-page-5.pdf”
doc=fitz.open(pdf_path)
page=doc.load_page(0)
pix=page.get_pixmap(dpi=300)
img=Image.frombytes(“RGB”,[pix.width,pix.height],pix.samples)
text=pytesseract.image_to_string(img)
text
‘Case 1:20-cr-00330-PAE Document 285-2 Filed 05/20/21 Page5of7\n\nSean to call AK back and let her know if the Florida agents had concerns. Sean never\ncalled AK back. AK doesn’t recall ever affirmatively following up with Sean, but she\ntook the radio silence to mean that the FBI agents in Florida did not express\ndissatisfaction.\n\nNo investigation was opened as a result of the February 29, 2016 meeting.\n\nNo investigation into Epstein was opened while AK was the Human Trafficking\nCoordinator and Project Safe Childhood coordinator.\n\nAK never met with any other Boies Schiller attorneys. AK does not recall ever speaking\nwith David Boies at any point.\n\nAK never met with the attorneys from the February 29, 2016 meeting again.\n\nWhen asked what, if any, communications AK had with the attorneys, AK noted that she\nsees an email from May 2016 with Stan Pottenger. AK believes it’s possible Stan called\nAK, but she does not remember that.\n\nAK does not recall any further conversations re Epstein in SDNY until much later when\nthe Miami Herald series was published.\n\nAK confirmed she has read the Daily News Article LP sent\n\nAK indicated that the article did not accurately describe AK’s interactions with attorneys\nfor Virginia Roberts.\n\nAK’s recollection is not that attorneys urged SDNY to open an investigation into “the\nduo.” They were focused on Epstein. Maxwell was mentioned in passing, not as a target.\nAK did not participate in a second meeting with anyone. Had there been a meeting on\nthis subject in the summer of 2016 in SDNY, AK would have known about it in her\ncapacity as Human Trafficking Coordinator.\n\no AK had been in that role since approximately 2010, and if someone came in to\nmeet with a unit chief about a potential human trafficking matter, AK would get\ncontacted about the meeting. AK met with chiefs to gather information about any\ncases they had that touched on human trafficking so that AK could coordinate. So\nif a second meeting had taken place with a supervisor in the office, AK would\nhave known about it. Dan Stein also would have brought AK in if he learned of a\nsecond meeting.\n\no AK would have made a record if there had ever been a second meeting. AK\nwould remember if there were a second meeting. She has no record and no\nmemory of any second meeting.\n\nAK does not recall ever speaking with or meeting David Boies in her life, so to her\nknowledge, Boies was not making any effort to persuade SDNY to investigate Epstein.\nAK noted it is possible Boies could have been on the phone if Stan Pottinger called AK\nin or around May of 2016, but AK has no recollection of such a call.\n\nAK does not recall the attorney presentation being framed on highlighting Maxwell’s\nassistance with Epstein’s sexual abuse. It was not presented as a “duo.” AK recalls the\nmention of a pilot and the mention of someone having an address book, so other people\nwere mentioned as part of telling the story of Epstein.\n\nSDNY_GM_02742890\n\nDOJ-OGR-00004184\n’And when I asked about it it said this:
It cut off for one simple reason:
Your OCR result was correct — my display was truncated, not the OCR.
When I responded earlier, the UI clipped the text visually after the word “radio”, because the raw OCR output contained a hidden or unprintable character immediately after that word. That character caused everything that followed to be suppressed when I displayed it inline, even though the OCR did capture the full page.
You can see the hidden character in your copy/paste here:
took the radio That glyph is the Unicode object replacement character. It usually appears when:
Tesseract sees a smudge or artifact it can’t interpret
A ligature or combined character doesn’t map to UTF-8 cleanly
There’s a rendering quirk in the environment displaying the text
Bottom line:
The OCR text itself is complete. My earlier displayed snippet was cut because the UI didn’t render the rest after encountering that character.
If you want, I can re-OCR with a stricter config to avoid replacement characters (e.g., using --oem 1 --psm 6, or filtering non-UTF-8).
It doesn’t paste but here’s an image of what shows up in one editor

It went on to suggest how to sanitize for unicode, but I couldn’t get the prompt to do that and also, still blind if 5.1 is writing tessaract and not showing, or doing it visual, but I hit this in about 7% of OCRed docs.
Does this help? It causes a serious problem. However, 5.1 is OCRing, it can capture redactions and other formatting tweaks that straight GCI and other services can’t and for speed and ease it’s great, but if we can’t sanitize upstream…
Your issue is not this forum topics. Earlier is discussed a prolific problem with AI that was trained on corpus data or post-training that was not properly encoded to Unicode, but instead escaped bytes. You also say “doing it in the desktop client”, which means to me you are using ChatGPT (or are you trying to debug a completely different platform than your own?), while this is an API topic.
Where are you obtaining the JSON Decode Error from? Your own API response handling code? The AI reporting what python tool produced?
Your issue is that you are using expensive AI repeatedly when you could have asked the AI to write a PDF OCR converter program for you.
Tesseract is pretty much the standard for programmatic open source OCR. You can try to have different deliverables as sandbox container file outputs, different available python modules searched and employed by the AI as the front end, form an idea of the script that can have success on one particular document’s deterministic output of a JSON container-breaking character or invalid bytes instead of proper Unicode.
I’m getting a similar issue. I’m not really a dev but am a HEAVY user of the API (prototyping mostly). We are using gpt5 to produce structured output. As part of the process I json.dump() into storage, and while most encoded values look fine (less than 2% apostrophes were getting encoded as \u0019 instead of \u2019 for example). I handed my code over to another dev who integrated with a .NET environment and he gets malformed/corrupt characters on nearly every run. So could be an environmental component at play here.
I inspected the raw response from gpt5 via json.loads() and it was indeed showing as bad, \x19 specifically in one example.
I was going to add some sanitization, as should anyway, but would be great if didn’t do this at all.
Should note, don’t know why it helped really but, I updated to 5.2, added some reasoning/verbosity levels instead of defaulting, and converted to responses API vs completions and we seem to be getting better results consistently, at least for now.
Yeah… I had to do a few things:
For 5.2 completions (translations to English), I added the follwing to the developer prompt:
- The Unicode quotation mark, U+0022, MUST be used used for ALL quoted text in the response.
- The Unicode apostrophe, U+02BC, MUST be used for ALL apostrophes in the response.
- The standard hyphen used in URLs must be the ASCII hyphen-minus (-).
- Do not use em-dashes in the response.
For 5.2 Reponse API , I added the follwing to the instructions prompt:
- The Unicode quotation mark, U+0022, MUST be used used for ALL quoted text in the response.
- The standard hyphen used in URLs must be the ASCII hyphen-minus (-).
- Do not use em-dashes in the response.
