Tokens are mangled for some non-English characters [resolved]

I’m working on a research project for which we are using the OpenAI API with GPT3. The query (in Python) looks as follows:

response = openai.Completion.create(
        model="text-davinci-003",
        prompt=prompt,
        max_tokens=0,
        n=1,
        logprobs=1,
        echo=True,
    )

For English sentences, everything works as expected, but for Sentences in German or Japanese, while response.choices[0].text looks fine, response.choices[0].logprobs.tokens is a mess, with many non-English characters replaced by cryptic bytes:\xXX sequences. For example:

prompt : "Übliche Öfen mögen Äpfel übermäßig." (notice the umlauts)
tokens: ['bytes:\\xc3', 'bytes:\\x9c', 'b', 'lic', 'he', ' Ö', 'fen', ' m', 'ö', 'gen', 'bytes: \\xc3', 'bytes:\\x84', 'p', 'f', 'el', 'bytes: \\xc3', 'bytes:\\xbc', 'ber', 'm', 'ä', 'ß', 'ig', '.']

While more chopped up than an English sentence, most tokens are fine, but some of the umlauts are instead replaced by pairs of tokens of the form bytes:\xc3.

For a Japanese sentence, it’s even worse, with most tokens being the mangled byte:\xXX.

And these aren’t even Unicode code points for the characters. EDIT: Yes they are, they are simply UTF8, see my response below.

Does anybody have an idea what exactly is happening? And is there some way to recover the readable tokens?

1 Like

I don’t know German or Japense so you will have to verify this idea.

While the individual bytes are not the correct Unicode charcters, together they seem to be, I.e.

'bytes: \\xc3', 'bytes:\\xbc'

Unicocde character C3BC

image


In the same way the three tokens ’ m’, ‘ö’, ‘gen’ combine to make a space and the word mögen the tokens as bytes combine to make the Unicode hex value for a Japense character.

HTH

3 Likes

I would like to know too, doing some completion for chinese text and received the tokens ['bytes: \\xe8', 'bytes:\\xab', 'bytes:\\x8b', 'bytes:\\xe5', 'bytes:\\x95'] but response.choices[0][‘text’] gives “請”… couldn’t figure out how the tokens could be converted to 請

1 Like

Converting a sequence of Unicode bytes into a displayable image is confusing at first because of the use of variable-width encoding.

To quickly figure out the Unicode bytes for 請 there are online sites such as https://onlineunicodetools.com/ with pages for specific conversrions e.g. https://onlineunicodetools.com/convert-unicode-to-bytes. Using the page 請 is 0xe8 0xab 0x8b which are the first three bytes you noted.

So the next question is how did the page do the conversion, which is also similar to your question. I used to know that years ago but can’t remember if you need to know the Unicode encoding such as UTF-8, UTF-16, UCS-2, UTF-32, and UCS-4 ahead of time or if it can be determined by looking at the start of the first bits of the first byte of the encoding. Once you know the Unicode encoding, then it is rather routine, as many programming languages have a Unicode library with conversion routines, e.g., the Unicode HOWTO — Python 3.11.2 documentation. One point to note that helps, there are several Unicode encodings, a few are so common that they should be tried first, e.g., UTF-8, UTF-16, and on Windows, also try UCS-2.

If there is meta data, such as on a web page or in the response of a method, it might indicate the Unicode encoding. If an app is used, check the documentation, as it may note that a specific encoding is used.

Since the result you are getting is in bytes, I would try UTF-8 first. Usually if a longer widith encoding is used programming libraries tend to pass the values in the size of the encoding.

HTH

Okay, I made a silly assumption, that being, that the python functions chr and ord work with UTF8. They don’t. As far as I can tell, they do UTF16?

In any case, turns out e.g. c39c is the correct UTF8 codepoint for Ü after all:

b'\xc3\x9c'.decode('utf-8')'Ü'

And @jimmychui , your tokens seem to be just UTF8 as well:

b'\xe8\xab\x8b'.decode('utf-8')'請'

Though the last two bytes in the array you gave seem to be missing a third one to form a correct UTF8 codepoint (I presume it’s 8f for '請問' ^^)

1 Like

One question remains though, that being, is there a neat way to get out the actual UTF8-encoded characters from the tokens?

The best I could come up with is this ugliness:

# With t being for example 'bytes: \\xc3'
eval("b'" + t[6:] + "'")

That way I get a bytes object which I then can concatenate with further tokens, before finally decoding the assembled UTF8 codepoint with _.decode('utf-8').

So for all tokens:

tokens = [eval("b'" + t[6:] + "'")
          if t.startswith("bytes:")
          else t.encode('utf-8')
          for t in tokens]

The canonical way to unescape in python, with _.decode('unicode-escape'), does not to work with escaped non-ASCII UTF8 characters. No clue why, though it appears to be the intended behavior (?) ._.

Here’s a way to recover the string which works for your example data:

# I edited your token list by removing extra spaces (e.g., from "bytes: \\x84")
# See below for what to do if those spaces really are there
tokens = [
    'bytes:\\xc3', 'bytes:\\x9c', 'b', 'lic', 'he', ' Ö', 'fen', ' m',
    'ö', 'gen', 'bytes:\\xc3', 'bytes:\\x84', 'p', 'f', 'el',
    'bytes:\\xc3', 'bytes:\\xbc', 'ber', 'm', 'ä', 'ß', 'ig', '.'
]
byte_tokens = []
for t in tokens:
     if t.startswith('bytes:\\x'):
         char_int = int(t[8:], base=16)
         byte_tokens.append(bytes([char_int]))
     else:
         byte_tokens.append(t.encode())
original = b''.join(byte_tokens).decode()
print(original)
# Übliche Öfen mögenÄpfelübermäßig.

Either you missed some data when copying the tokens over, or the tokens are missing spaces before words that start with a non-Latin character.

If some of the tokens really do have a space between “bytes:” and the Unicode digits, then change this line:

char_int = int(t[8:], base=16)

by replacing it with:

char_int = int(t.split('x', maxsplit=1)[1], base=16)