I’m working on a research project for which we are using the OpenAI API with GPT3. The query (in Python) looks as follows:
response = openai.Completion.create( model="text-davinci-003", prompt=prompt, max_tokens=0, n=1, logprobs=1, echo=True, )
For English sentences, everything works as expected, but for Sentences in German or Japanese, while
response.choices.text looks fine,
response.choices.logprobs.tokens is a mess, with many non-English characters replaced by cryptic
bytes:\xXX sequences. For example:
"Übliche Öfen mögen Äpfel übermäßig." (notice the umlauts)
['bytes:\\xc3', 'bytes:\\x9c', 'b', 'lic', 'he', ' Ö', 'fen', ' m', 'ö', 'gen', 'bytes: \\xc3', 'bytes:\\x84', 'p', 'f', 'el', 'bytes: \\xc3', 'bytes:\\xbc', 'ber', 'm', 'ä', 'ß', 'ig', '.']
While more chopped up than an English sentence, most tokens are fine, but some of the umlauts are instead replaced by pairs of tokens of the form
For a Japanese sentence, it’s even worse, with most tokens being the mangled
And these aren’t even Unicode code points for the characters. EDIT: Yes they are, they are simply UTF8, see my response below.
Does anybody have an idea what exactly is happening? And is there some way to recover the readable tokens?