GPT API Failed to create completion as the model generated invalid Unicode output

I got very strange Open AI message that I can not find anywhere on the internet? Here is what I got when tried to call API>

Something went wrong when call GPT. Detail Error(code=invalid_model_output, message=Failed to create completion as the model generated invalid Unicode output. Unfortunately, this can happen in rare situations. Consider reviewing your prompt or reducing the temperature of your request. You can retry your request, or contact us through our help center at help.openai.com if the error persists. (Please include the request ID req_a90b4e95a18fcab8e3d188e5d11b919f in your message.

Could any developers hep me get rid of this?

1 Like

The OpenAI tokenizer produces bytes, and has some autonomy to produce novel UTF-8 Unicode character encodings, that may range from one to four bytes in length. There are about a million valid Unicode code points, and over 100k actual characters.

That also means that there are about 16 million invalid byte sequences once upper unicode beyond one byte is triggered.

The AI output something that couldn’t be decoded. This is a model problem, or a problem with the inputs, or as the error message guides you, the random chance of an unlikely and invalid output being predicted and used.

If you can reproduce the error deterministically at top_p=0.0001, that would be even more remarkable than a random happenstance output.

I would guess that this is most possible when using gpt-4-1106-preview or vision, as those models were trained with bad unicode when employing functions.

Oh, I see. The weird thing is sometimes it reports an error, sometimes it does not.
I ask GPT to extract body content from a markdown. The content language is Vietnamese, do you have any solutions to detect where it’s getting wrong?

Update 1:

  • I found the pretty fun workaround solution. I change model from gpt-3.5-turbo to gpt-3.5-turbo-16k. Then it’s seem working well.
1 Like

Since gpt-3.5-turbo-1106, I am also experiencing a similar Unicode problem with parsing Japanese strings containing Kanji characters. In some cases, the second and subsequent bytes of UTF-8 characters are unintentionally altered when the prompt is longer. For example, a situation where the character 田 (e7 94 b0) is expected, 町 (e7 94 ba) is returned, resulting in a strange response as Japanese. This was happening with gpt-3.5-turbo-0125 and gpt-3.5-turbo-1106, but not with gpt-3.5-turbo-16k.

In my case, including the following statement in the prompt has alleviated this phenomenon, but has not eliminated it completely.

[placeholder] are written in Japanese, and byte sequence of each UTF-8 multibyte character must be retained and not modified.

It is a good thing that I don’t have to determine the error myself, since the API has very recently started returning error responses instead of wrong characters.

1 Like