How to Sanitize illegal characters before sending to GPT? EDIT: Solved

I have encountered a problem with GPT where it returns a 400 error with the message “Bad request” if I include a specific character in it. Below is the JSON that I am sending when making a request to gpt-3.5-turbo-16k. It is this specific character that makes it fail. When I put other, normal characters or text in, it does not fail.

{
  "model":"gpt-3.5-turbo-16k",
  "messages":[
    {
      "role":"user",
      "content":"ð"
    }
  ]
}

Is there a way to sanitize text before sending it to GPT to avoid this kind of failure?

The obvious answer is “don’t use that character,” but much of the incoming text is from users. I think this might have something to do with character encoding, as when my Java application sends this, it will fail on Windows machines, but not when it is running on a Mac. It’s very strange.

Any advice appreciated.

Out of curiosity, have you tried sending that to the Moderation endpoint?

It’s either a bug, or it’s used as some kind of marker.

Might be worth creating a bug entry for it, I had a quick search and nothing for ascii 208.

Just following up, but I found the issue. It’s specific to my implementation, but on the off chance someone else runs into something similar, I will include my solution.

The problem is that Java running Windows does not use the same default encoding when converting strings to text that other platforms do. I updated the byte array conversion to the below, and all problems vanished.

<MY_TEXT>.getBytes(StandardCharsets.UTF_8)

2 Likes

Could be a glitch token, but I’m not actually seeing any issue with it. I see OP has fixed it.

@Draque would you mind sharing what the original encoding was? And in the future, you might want to look into “input santization” (keywords). That’s a fairly well solved problem that has been plaguing database and web-admins for ages now. You’ve got the first step, convert to a standard encoding, but if you start seeing more/similar issues you might want to create a blacklist and replace with <?> or something.