Does GPT-4 API interprets the encoded (UTF-8) emoji as literal emoji?

shopandserveonline · September 18, 2023, 2:44am

I’m currently performing text classification using GPT-4 via its API (8k context) my data has been accidentally encoded into UTF-8 during my data preprocessing. Like this:

Thank you so much!Ã°Å¸â„¢ÂÃ¢ÂÂ¤Ã¯Â¸Â

which should be

Thank you so much!

I’m new to this field, and currently conducting my undergrad thesis. I’m anxious if ever my panel asked me if GPT-4 understands that those are emojis or just gibberish symbols. huhu thank you

curt.kennedy · September 18, 2023, 3:26am

Yes, the model understands emoji’s and UTF-8 for sure. Example:

_j · September 18, 2023, 4:43am

AI can’t fix your text:

The original text appears to contain emoji and special characters. Here’s the repaired version of the text:

The actual internal AI tokens of Unicode are bytes encoded as such:

Token Number	Token Contents	Rendered Unicode
35087	b’\xe6\xb8’	淸
85301	b’\xe6\xb9’	湹
36118	b’\xe6\xba’	溺
73598	b’\xe6\xbb’	滿
78257	b’\xe6\xbc’	漷
164	b’\xe7’	率
66286	b’\xe7\x81’	爁
32017	b’\xe7\x84’	焄
85664	b’\xe7\x84\xa5\xe3\x81\x97\xe3\x81’	焥し
76208	b’\xe7\x88’	爈
17045	b’\xe7\x89’	牅
29208	b’\xe7\x8e’	玎
80243	b’\xe7\x8f’	珏

There’s very few byte-sized symbols:
[15555, 1, ‘█’]
[22178, 2, ‘██’]
[52779, 4, ‘████’]
[85158, 1, ‘░’]
[47775, 1, ‘■’]
[83565, 1, ‘►’]
[45049, 1, ‘●’]
[27348, 1, ‘★’]
[49618, 2, ‘★★’]
[47239, 1, ‘☆’]
[67581, 1, ‘☴’]
[32991, 1, ‘’]
[40786, 2, ‘’]
[88040, 4, ‘’]
[77810, 1, ‘’]
[40621, 1, ‘♪’]
[66326, 3, ‘♪\n\n’]
[75352, 1, ‘’]
[40710, 1, ‘⟩’]

Here’s a table I had AI make, with only slight mangling to give you an idea how you should send and receive.

Python Escaped Sequence	HTTP JSON Escape	HTML Code
`\u1F600`	`\u1F600`	`😀`
`\u2B50`	`\u2B50`	`⭐`
`\u1F680`	`\u1F680`	`🚀`
`\u1F422`	`\u1F422`	`🐢`
`\u1F33A`	`\u1F33A`	`🌺`

extended Unicode emoji:

Python Escaped Sequence	HTTP JSON Escape	HTML Code
`\u262F\uFE0F`	`\u262F\uFE0F`	`☯️`
`\U0001F1EF\U0001F1F5`	`\u{1F1EF}\u{1F1F5}`	`🇯🇵`
`\u1F549\uFE0F`	`\u1F549\uFE0F`	`🕉️`
`\u1F30D\uFE0F`	`\u1F30D\uFE0F`	`🌍️`
`\u1F3AD\uFE0F`	`\u1F3AD\uFE0F`	`🎭️`

(the “http” I requested is actually stringified-for-json)

So you’ll likely need some processor to recognize the possible unicode bytes and convert them by chance and luck.

shopandserveonline · September 18, 2023, 5:18am

Hi, thank you for your response.

I’m not sure how to fix the encoded unicode. It happened when I saved the processed data using to_csv which has a default of UTF-8. If ever, i think i’ll just copy the emojis from the original texts?

shopandserveonline · September 18, 2023, 5:20am

i see, but would that change if we access the GPT-4 model via API?

_j · September 18, 2023, 5:23am

The answer could be “ask an AI”, in this case gpt-4 prompted as my programming pal and decoding desperado:

Sure, you can use the codecs module in Python to decode the UTF-8 encoded string back to its original form. Here’s a simple function that does this:

import codecs

def decode_utf8(s):
    return codecs.decode(s, 'unicode_escape')

# Test the function
s = "Thank you so much!Ã°Å¸â„¢ÂÃ¢ÂÂ¤Ã¯Â¸Â"
print(decode_utf8(s))

However, the string you provided seems to be doubly encoded, first as UTF-8 and then as ISO-8859-1 (or latin1). In this case, you need to decode it in two steps:

def decode_doubly_encoded_utf8(s):
    bytes_s = s.encode('latin1')  # Convert to bytes using latin1 encoding
    return bytes_s.decode('utf8')  # Decode the bytes using utf8

# Test the function
s = "Thank you so much!Ã°Å¸â„¢ÂÃ¢ÂÂ¤Ã¯Â¸Â"
print(decode_doubly_encoded_utf8(s))

This should print:

Thank you so much!🙏❤️

Please note that the actual output may vary depending on the environment where the code is run. Some environments may not support displaying certain Unicode characters.

Topic		Replies	Views
API response includes encoded emojis API api , emojis , gpt-4o-mini	6	796	January 26, 2025
Weird characters in-between chat API response 😭 API	9	228	April 18, 2025
GPT 4 Function Calling not using Emojis API gpt-4	10	2105	September 16, 2023
Support of unicode in gpt4-1106-preview Bugs gpt-4 , api	10	2292	November 15, 2024
Weird characters like Ø±Ð´Ñ in ouput when doing translation API	5	1782	December 24, 2023

Does GPT-4 API interprets the encoded (UTF-8) emoji as literal emoji?

Related topics