How to Sanitize illegal characters before sending to GPT? EDIT: Solved

Draque · August 5, 2023, 10:43pm

I have encountered a problem with GPT where it returns a 400 error with the message “Bad request” if I include a specific character in it. Below is the JSON that I am sending when making a request to gpt-3.5-turbo-16k. It is this specific character that makes it fail. When I put other, normal characters or text in, it does not fail.

{
  "model":"gpt-3.5-turbo-16k",
  "messages":[
    {
      "role":"user",
      "content":"ð"
    }
  ]
}

Is there a way to sanitize text before sending it to GPT to avoid this kind of failure?

The obvious answer is “don’t use that character,” but much of the incoming text is from users. I think this might have something to do with character encoding, as when my Java application sends this, it will fail on Windows machines, but not when it is running on a Mac. It’s very strange.

Any advice appreciated.

Foxalabs · August 5, 2023, 11:18pm

Out of curiosity, have you tried sending that to the Moderation endpoint?

It’s either a bug, or it’s used as some kind of marker.

Might be worth creating a bug entry for it, I had a quick search and nothing for ascii 208.

Draque · August 5, 2023, 11:37pm

Just following up, but I found the issue. It’s specific to my implementation, but on the off chance someone else runs into something similar, I will include my solution.

The problem is that Java running Windows does not use the same default encoding when converting strings to text that other platforms do. I updated the byte array conversion to the below, and all problems vanished.

<MY_TEXT>.getBytes(StandardCharsets.UTF_8)

chrstfer · August 6, 2023, 12:04am

Could be a glitch token, but I’m not actually seeing any issue with it. I see OP has fixed it.

@Draque would you mind sharing what the original encoding was? And in the future, you might want to look into “input santization” (keywords). That’s a fairly well solved problem that has been plaguing database and web-admins for ages now. You’ve got the first step, convert to a standard encoding, but if you start seeing more/similar issues you might want to create a blacklist and replace with <?> or something.

Topic		Replies	Views
Mangled enDashes and emDashes receivied via API API	8	1095	December 18, 2023
GPT API Failed to create completion as the model generated invalid Unicode output API gpt-35-turbo , api	3	3349	April 1, 2024
GPT-4 returning HTML entities API gpt-4 , api	5	1886	September 8, 2023
Quotation marks in API response breaking follow-up responses API	6	3997	December 18, 2023
Wrong encoding for gpt-4o during API Chat completion Bugs	2	893	May 15, 2024

How to Sanitize illegal characters before sending to GPT? EDIT: Solved

Related topics