I’m working on a research project for which we are using the OpenAI API with GPT3. The query (in Python) looks as follows:
response = openai.Completion.create(
model="text-davinci-003",
prompt=prompt,
max_tokens=0,
n=1,
logprobs=1,
echo=True,
)
For English sentences, everything works as expected, but for Sentences in German or Japanese, while response.choices[0].text
looks fine, response.choices[0].logprobs.tokens
is a mess, with many non-English characters replaced by cryptic bytes:\xXX
sequences. For example:
prompt : "Übliche Öfen mögen Äpfel übermäßig."
(notice the umlauts)
tokens: ['bytes:\\xc3', 'bytes:\\x9c', 'b', 'lic', 'he', ' Ö', 'fen', ' m', 'ö', 'gen', 'bytes: \\xc3', 'bytes:\\x84', 'p', 'f', 'el', 'bytes: \\xc3', 'bytes:\\xbc', 'ber', 'm', 'ä', 'ß', 'ig', '.']
While more chopped up than an English sentence, most tokens are fine, but some of the umlauts are instead replaced by pairs of tokens of the form bytes:\xc3
.
For a Japanese sentence, it’s even worse, with most tokens being the mangled byte:\xXX
.
And these aren’t even Unicode code points for the characters. EDIT: Yes they are, they are simply UTF8, see my response below.
Does anybody have an idea what exactly is happening? And is there some way to recover the readable tokens?