Batch file results encoding irregularities

Hi everyone!

I’m having trouble with the output file from the batch API, specifically, in one single file a single apostrophe character is converted to this unicode, “\u2019” and the “\” is not escaped. In the same file but in a different line the same single apostrophe character is converted to the same unicode, but the “\” is escaped, giving me this in the string, “\\u2019”. I’m using .NET and when I decode the byte array using UTF8, the second example correctly converts to a single apostrophe, while the first example converts to a question mark.

Has anyone else run into this? Does OpenAI indiscriminately escape some characters and not others?

I should note, the original batch job is to a fine-tuned 4o model and meant to return a structured output based on the provided JSON schema in the request.

You might be talking about json in a json…in a json.

The reason why it was inserted: that Unicode token is what the AI produced as a native multi-byte UTF-8 sequence. Maybe a strange predicted apostrophe usage, maybe intentional. It is simply JSON-encoded for transmission.

I type away at the Python console to give you an idea what’s going on. The actual string in memory doesn’t contain 2019, it contains the Unicode.

Some of that was operating on the python dict though. I’d better show what you need to replicate:

A frontier AI is going to produce better, faster advice than me about the facilities available for JSON parsing in .NET, so some free tokens of advice.

Proper Unicode JSON Handling in .NET

Understanding the Issue Clearly:

The issue you’ve encountered involves proper handling and parsing of Unicode escape sequences (\uNNNN) within JSON strings when working in a .NET context. The confusion typically arises from misunderstanding JSON encoding rules, especially with respect to escaping special characters and unicode representations.

Here’s a quick recap of the behavior you’re seeing clearly:

  • Your AI-generated output string contains a special Unicode apostrophe: (Unicode U+2019).
  • When encoding this string as JSON, it’s entirely correct and standard for the apostrophe to be represented using the unicode escape: \u2019.
  • When double escaping occurs (\\u2019), you’re observing that the final decoded UTF8 in .NET is correct. But with single-escaping (\u2019), it’s incorrectly showing a question mark (?), indicating a decoding/encoding error in .NET side.

The Root Cause:

The core issue here is not about OpenAI indiscriminately escaping some characters or being inconsistent—it’s about understanding clearly how JSON encoding and decoding works:

  • Standard JSON Encoding: JSON specification states that Unicode characters can (and often do) get represented in their escaped Unicode form \uNNNN. JSON encoding does NOT require double escaping of backslashes in properly encoded JSON strings. A valid JSON payload would indeed look like:
{
  "message": "before\u2019after"
}
  • Incorrectly Escaped JSON: If you see a string like "\\u2019" within valid JSON, this literally translates to the string containing a backslash followed by “u2019”, rather than the Unicode character itself. JSON decoders interpret double backslashes (\\) as a literal single backslash (\) character in the resulting string value. Thus, "\\u2019" is decoded to a literal string containing characters \u2019 rather than the actual Unicode apostrophe (). This is typically a side-effect of manually handling JSON, or erroneously double-escaping strings before writing to the JSON payload.

Therefore, the first example ("message": "before\u2019after") is the correct and desired encoding for JSON transmission.

Recommended JSON handling in .NET (proper solution):

In .NET, the correct JSON-handling strategy is straightforward and robustly supported through native libraries, particularly System.Text.Json (recommended in modern .NET) or the older but still popular Newtonsoft.Json.

Key practices to follow:

  • Do NOT manually perform escaping or encoding: Rely entirely on built-in JSON serializers (System.Text.Json.JsonSerializer or Newtonsoft.Json.JsonConvert).
  • Use UTF-8 Encoding consistently: JSON payloads must always be encoded/decoded using UTF-8 encoding explicitly, especially when dealing with web APIs.

Here’s exactly how you handle this using .NET’s built-in mechanisms:

Step-by-Step Structured Approach (.NET):

1. Using System.Text.Json (Recommended approach in modern .NET):

Serialization (C# example):

using System;
using System.Text.Json;
using System.Text;

class Program
{
    static void Main()
    {
        var obj = new { message = "before\u2019after" };

        // Serialize object correctly to JSON (UTF8 bytes)
        var options = new JsonSerializerOptions
        {
            WriteIndented = true,
            Encoder = System.Text.Encodings.Web.JavaScriptEncoder.UnsafeRelaxedJsonEscaping
        };

        byte[] jsonUtf8Bytes = JsonSerializer.SerializeToUtf8Bytes(obj, options);

        // Convert JSON UTF8 bytes to string for display or debugging
        string jsonString = Encoding.UTF8.GetString(jsonUtf8Bytes);

        Console.WriteLine("Serialized JSON:");
        Console.WriteLine(jsonString);
    }
}

This will output correct JSON representation:

{
  "message": "before\u2019after"
}

This JSON is correct and fully standard compliant.


Deserialization (C# example):

When you consume JSON from API endpoints, directly deserialize from UTF-8 bytes:

using System;
using System.Text;
using System.Text.Json;

class Program
{
    static void Main()
    {
        // This is your input JSON (from API as byte array):
        string jsonFromApi = "{\"message\": \"before\\u2019after\"}";

        byte[] jsonUtf8Bytes = Encoding.UTF8.GetBytes(jsonFromApi);

        // Deserialize correctly from UTF8 bytes directly
        var obj = JsonSerializer.Deserialize<MyObj>(jsonUtf8Bytes);

        Console.WriteLine("Decoded message: " + obj.message);
    }

    class MyObj
    {
        public string message { get; set; }
    }
}

Output:

Decoded message: before’after

Notice clearly:

  • JSON input explicitly written as: "before\u2019after". The JSON string uses one backslash to denote a unicode escape sequence.
  • After decoding by the library, it correctly resolves the Unicode escape to the correct UTF8 character ().

2. Using Newtonsoft.Json (alternative):

Deserialization Example (C#):

using Newtonsoft.Json;
using System;

class Program
{
    static void Main()
    {
        string jsonFromApi = "{\"message\": \"before\\u2019after\"}";
        
        var obj = JsonConvert.DeserializeObject<MyObj>(jsonFromApi);
        
        Console.WriteLine("Decoded message: " + obj.message);
    }

    class MyObj
    {
        public string message { get; set; }
    }
}

Output:

Decoded message: before’after

Why the user’s decoding returned a ‘?’ (Question Mark):

This indicates a character encoding mismatch or improper decoding step:

  • Typically happens when attempting to manually decode byte arrays or strings using incorrect encoding settings (like ASCII or Latin-1) rather than UTF-8.
  • Always explicitly decode byte arrays as UTF-8 before deserializing to a string.

For example, always use:

string jsonString = Encoding.UTF8.GetString(byteArrayFromApi);

Incorrect encodings can lead to replacing Unicode characters with question marks ?.


Final Recommendations (Summary):

  • Always encode and decode JSON payloads explicitly as UTF-8.
  • Trust your .NET JSON library (System.Text.Json or Newtonsoft.Json) to handle Unicode correctly without manual intervention.
  • Never manually escape or double-escape backslashes or unicode characters in JSON.
  • The OpenAI API’s behavior described is correct and consistent with standard JSON handling.

Following the above best-practice examples ensures robust, consistent handling of Unicode JSON payloads and resolves your described issue with incorrect character representations in .NET.

1 Like

Thanks for the reply! I did read through the AI response you included, and can tell with with certainty that it’s making incorrect assumptions. It took me a long time yesterday to get ChatGPT 4o to understand the issue. If I followed the suggestions from the AI response you linked, ‘\\u2019’ would get output as ‘\u2019’. So, I still need to implement custom logic to handle that case.

Where I’m at now is decoding the byte array normally, and then running this custom decoder over it before parsing it:

private string DecodeUnicodeEscapes(string input)
{
    string intermediate = input.Replace("\\\\u", "\\u");

    string output = Regex.Replace(intermediate, @"\\u([0-9A-Fa-f]{4})", match =>
    {
        int unicodeValue = Convert.ToInt32(match.Groups[1].Value, 16);
        return ((char)unicodeValue).ToString();
    });

    return output;
}

I’m still testing, but it seems to be working.

1 Like