I am experiencing similar issues with output that has Korean/English mixed output.
In the output below, ' g c g e e c h e w a s n e w .' is supposed to be Korean writing.
The student translated ' g c g e e c h e w a s n e w .' as
'They shared their experience.'
Originally, my prompt surrounded user input in <<brackets like this>> and it caused significantly worse output.
JSON Generation: When JSON is generated, certain characters are escaped to ensure the resulting string is a valid JSON. For example, the < character might be escaped as \u003c and % as \u0025. This is standard behavior in many JSON libraries to prevent potential security issues like XSS attacks.
I don’t believe so since the behavior is nondeterministic (only sometimes does this with certain sql queries) which is why I suspect it might be a quirk of the model.
You can run the strings through the Python codecs.decode library or similar, which should make attempts to decode escape sequences. The AI may just want to write escapes. You can see if they are preventing by demoting \ and \u as logit_bias parameter after using a tokenizer to get the cl100k_base token numbers, or similar instruction saying what should be avoided as output.