Mangled enDashes and emDashes receivied via API

Hi there,

I am receiving GPT4 responses (gpt-4-0314) via a stream that I handle with Volley within an Android app. The string response occasionally contains a mangled triplet of chars that is clearly meant to encode a dash (an en-dash or an em-dash). This error occurs while I am still handling the response as a raw string, with no attempt on my part to enforce any encoding. Later I save the stream as a UTF8 text file, and find the triplet on reading the file, but I can detect it within the response from GPT, before I even decode the JSON object.

My understanding is that Volley uses UTF8 by default, so this should not happen within the Volley code.

My workaround for now is to replace the offending triplets with a space-hyphen-space combination, but it seems very cumbersome. While it is possible I have made an error (I am new to Kotlin), this seems to be coming from the server, as I pick it up directly in my response listener.

The bad triplet strings are as follows:
val badString1 = “\u00E2\u0080\u0093”
val badString2 = “\u00E2\u0080\u0094”

An example of an affected response is:
data: {“id”:"chatcmpl-…,“object”:“chat.completion.chunk”,“created”:1691284934,“model”:“gpt-4-0314”,“choices”:[{“index”:0,“delta”:{“content”:“—”},“finish_reason”:null}]}

If this is a known error with Volley or GPT, then I’ll stick with my workaround until it is fixed. If no one else has seen it and other developers are getting correctly encoded en-dashes and em-dashes, then please let me know what set-up you are using.

If OpenAI would like a self-contained bug reproducer, I can work on one, but I might not have time for a little while. I have not checked whether other GPT4 models have the same issue.

On posting this, the ‘content’ field has dropped two unprintable characters, keeping only ‘â’, but the content itself was originally a string of three chars, corresponding to one of the badString values.

If they are present in all messages then it’s a bug in whatever handling code is calling the API endpoints, you should sanitise the returned data into UTF8 and perhaps keep a black list of unwanted text, the model can return things like this if they are in the training data often enough.

Also worth checking that the temperature value is well below 2, preferably 1 or below.

Thanks… I’ll look into that. For now, there are only two bad strings I have to check for, so it is not a big issue.

I suppose I had been assuming the error was on the dumb-code side of things (including my own code), but I guess it could be genuine output from a flawed AI, with no bug as such. It makes sense that there might be a lot of mis-encoded chunks of text in the training data, just like there is probably bad html, bad code examples, bad grammar and badly written fiction, etc… I had hoped that simple encoding errors would have been cleaned prior to training, but I guess that cleaning such a massive dataset is an impossible task.

Still, this would be an easy thing to catch and fix on the server side, to save all the API users from having to catch it - so if that is the answer, maybe OpenAI could patch it?

Lets not jump the gun on this, it could be an issue in the training data, but if it is then this sequence is so common that not including it would be the error, e.g. the newline \n sequence at the end of a paragraph. So the likelihood is that the code handling the endpoint is incorrectly dealing with something like a block size element (common in streaming handlers).

Also what is the question you are asking that gets the reply of â ?

The â is simply one of a three-char string that is clearly intended to be an en-dash or em-dash. The other two chars are as shown in badString1 and badString2. From context, it is clear that GPT is simply telling a story, and puts in the dashes appropriately as ordinary punctuation marks, but they arrive in the stream as a three-char string indicating faulty encoding somewhere along the line. This happens before I do anything at all with the returned result.

There is a separate unicode char for an em-dash or en-dash, so that’s what GPT should be using when a dash is wanted; changing this three-char sequence on the server side (if that’s where the issue is) will not prevent GPT from outputting anything it should be outputting. There can’t be many genuinely useful uses of those two three-char strings (except for the discussion of bugs like this, where there are alternatives such as the unicode escape sequences I used.)

So, the question was something like: tell me a story. For weeks I have not seen faulty characters, but I recently switched to this particular GPT-4 model because of completely different (semantic-level) errors with other models.

It could well be a Volley error, as that is the library I am using. But that seems weird as UTF8 is the default encoding.

Somewhat related to this, I guess: