Json_object output format ruins unicode

claytongrassick · October 31, 2024, 3:29pm

The same prompt outputs correct JSON output when in “text” mode and incorrect JSON output when in “json_object” mode.

Here’s the system prompt:

Translate the provided content into German. 
The content is provided as a JSON object containing an array "entries". 
Each object in the array contains "id", "context" and "html". 
Translate only the text within the "html" field, preserving all HTML tags and attributes.
Output as a JSON object containing an array "entries" of objects with the same format as the input. 
Output one entry for each input entry, in the same order as the input.

Here’s the user prompt:

{"entries": [{"html": "How would one decide on an ionization technique for acetaminophen or \u03b2-cyclodextrin?"}]}

In text mode, the output is correct:

{
  "entries": [
    {
      "html": "Wie würde man eine Ionisationstechnik für Acetaminophen oder \u03b2-Cyclodextrin auswählen?"
    }
  ]
}

But in json_object mode, it’s wrong:

{
  "entries": [
    {
      "html": "Wie würde man eine Ionisationstechnik für Acetaminophen oder \\\\u03b2-Cyclodextrin auswählen?"
    }
  ]
}

See the quadruple backslashes that are incorrect.

This occurs (differently, but still incorrectly) in both gpt-4o and gpt-4-turbo.

It has a significant impact as we can’t constrain the output to json and need to remove the triple backticks in the output.

j.wischnat · October 31, 2024, 3:34pm

You could just sanitize the output, right?

Alternatively, try changing the hyperparameters like temperature to a lower value and see if that results in a more expected output.

Feel free to update with findings!

claytongrassick · October 31, 2024, 3:40pm

Temperature zero has the same output. Also, unfortunately, it’s not just a matter of sanitization, as the output itself is incorrect. It will literally render as “\u03b2-Cyclodextrin”, not “β-Cyclodextrin”

_j · October 31, 2024, 4:08pm

The characters are being escaped for the multiple nesting levels.

API response object, containing
messages.content object, containing
json object

The duplicated backslashes are duplicated again to preserve them.

Actually doing a “print” of the response content string, the bytes of \u03b2 are finally turned into the unicode:

{"response_to_user":"Here is the rendered unicode character for \\u03b2: β\n\nHere are some other similar Greek characters in their rendered unicode:\n- α (alpha) - \\u03b1\n- γ (gamma) - \\u03b3\n- δ (delta) - \\u03b4\n- ε (epsilon) - \\u03b5\n- θ (theta) - \\u03b8\n","response_topic":"Unicode characters"}

It is all about your code, which you don’t hint at.

Here’s a hint in hint format, mangled by the forum escaping.

{
"response_to_user": "To unpack and print the Unicode character \\u03b2 (which represents the Greek letter beta, \u03b2) from a JSON response in a programming language like Python, you can use thejson module to parse the JSON. Here\u2019s an example:\n\n```python\nimport json\n\n# Example JSON response containing the Unicode character\njson_response = '{\"letter\": \"\\u03b2\"}'\n\n# Parse the JSON response\ndata = json.loads(json_response)\n\n# Access the value and print it\nprint(data['letter']) # Output: \u03b2\n```\n\nIf you're using JavaScript, it would look like this:\n\n```javascript\n// Example JSON response\nconst jsonResponse = '{\"letter\": \"\\u03b2\"}';\n\n// Parse the JSON response\nconst data = JSON.parse(jsonResponse);\n\n// Access the value and print it\nconsole.log(data.letter); // Output: \u03b2\n```\n\nThis should correctly unpack the Unicode character and display it as intended.",
"response_topic": "Unicode in JSON"
}

claytongrassick · October 31, 2024, 4:30pm

The code I’m using is identical whether I specify json_object or text, so I don’t believe the issue is one of escaping. Also, for example, gpt-4-turbo outputs the following:

{
  "entries": [
    {
      "html": "Wie würde man eine Ionisationstechnik für Acetaminophen oder \beta-Cyclodextrin auswählen?"
    }
  ]
}

in json_object mode which is wrong also.

claytongrassick · October 31, 2024, 4:32pm

Here’s a playground with this exact prompt:

https://platform.openai.com/playground/chat?preset=aqVUB8LwK03JphyyjwxO4tG9

_j · October 31, 2024, 5:01pm

It may be that you want HTML, the AI is also just being dumb about HTML and overtrained on its own byte sequences.

See how it writes “response_string” as your name.

Tell Mr Chatbot that you want HTML numeric character references, like β in the HTML it writes, and see if that doesn’t avoid the whole escaping issue.

I still think it could be just a matter of unwrapping your JSONs correctly instead of casting them to other objects.

Topic		Replies	Views
Support of unicode in gpt4-1106-preview Bugs gpt-4 , api	10	2367	November 15, 2024
GPT-4o returning malformed Unicode like \u0000e6 instead of æ — encoding bug? Bugs	1	81	July 25, 2025
{ "type": "json_object" } not always working Prompting gpt-4	5	666	January 2, 2025
Weird characters like Ø±Ð´Ñ in ouput when doing translation API	5	1873	December 24, 2023
Wrong encoding for gpt-4o during API Chat completion Bugs	2	1387	May 15, 2024

Json_object output format ruins unicode

Related topics