Does GPT-4 API interprets the encoded (UTF-8) emoji as literal emoji?

I’m currently performing text classification using GPT-4 via its API (8k context) my data has been accidentally encoded into UTF-8 during my data preprocessing. Like this:

Thank you so much!🙏❤️

which should be

Thank you so much!:pray::heart:

I’m new to this field, and currently conducting my undergrad thesis. I’m anxious if ever my panel asked me if GPT-4 understands that those are emojis or just gibberish symbols. huhu thank you

Yes, the model understands emoji’s and UTF-8 for sure. Example:

2 Likes

AI can’t fix your text:

The original text appears to contain emoji and special characters. Here’s the repaired version of the text:
:running_man::heart::small_orange_diamond:

The actual internal AI tokens of Unicode are bytes encoded as such:

Token Number Token Contents Rendered Unicode
35087 b’\xe6\xb8’
85301 b’\xe6\xb9’
36118 b’\xe6\xba’
73598 b’\xe6\xbb’ 滿
78257 b’\xe6\xbc’
164 b’\xe7’
66286 b’\xe7\x81’
32017 b’\xe7\x84’
85664 b’\xe7\x84\xa5\xe3\x81\x97\xe3\x81’ 焥し
76208 b’\xe7\x88’
17045 b’\xe7\x89’
29208 b’\xe7\x8e’
80243 b’\xe7\x8f’

There’s very few byte-sized symbols:
[15555, 1, ‘█’]
[22178, 2, ‘██’]
[52779, 4, ‘████’]
[85158, 1, ‘░’]
[47775, 1, ‘■’]
[83565, 1, ‘►’]
[45049, 1, ‘●’]
[27348, 1, ‘★’]
[49618, 2, ‘★★’]
[47239, 1, ‘☆’]
[67581, 1, ‘☴’]
[32991, 1, ‘:female_sign:’]
[40786, 2, ‘:female_sign::female_sign:’]
[88040, 4, ‘:female_sign::female_sign::female_sign::female_sign:’]
[77810, 1, ‘:heart:’]
[40621, 1, ‘♪’]
[66326, 3, ‘♪\n\n’]
[75352, 1, ‘:heavy_check_mark:’]
[40710, 1, ‘⟩’]

Here’s a table I had AI make, with only slight mangling to give you an idea how you should send and receive.

Rendered Python Escaped Sequence HTTP JSON Escape HTML Code
:grinning: \u1F600 \u1F600 😀
:star2: \u2B50 \u2B50 ⭐
:rocket: \u1F680 \u1F680 🚀
:turtle: \u1F422 \u1F422 🐢
:hibiscus: \u1F33A \u1F33A 🌺

extended Unicode emoji:

Rendered Python Escaped Sequence HTTP JSON Escape HTML Code
:yin_yang: \u262F\uFE0F \u262F\uFE0F ☯️
:jp: \U0001F1EF\U0001F1F5 \u{1F1EF}\u{1F1F5} 🇯🇵
:om: \u1F549\uFE0F \u1F549\uFE0F 🕉️
:earth_africa: \u1F30D\uFE0F \u1F30D\uFE0F 🌍️
:performing_arts: \u1F3AD\uFE0F \u1F3AD\uFE0F 🎭️

(the “http” I requested is actually stringified-for-json)

So you’ll likely need some processor to recognize the possible unicode bytes and convert them by chance and luck.

2 Likes

Hi, thank you for your response.

I’m not sure how to fix the encoded unicode. It happened when I saved the processed data using to_csv which has a default of UTF-8. If ever, i think i’ll just copy the emojis from the original texts?

i see, but would that change if we access the GPT-4 model via API?

The answer could be “ask an AI”, in this case gpt-4 prompted as my programming pal and decoding desperado:


Sure, you can use the codecs module in Python to decode the UTF-8 encoded string back to its original form. Here’s a simple function that does this:

import codecs

def decode_utf8(s):
    return codecs.decode(s, 'unicode_escape')

# Test the function
s = "Thank you so much!🙏❤️"
print(decode_utf8(s))

However, the string you provided seems to be doubly encoded, first as UTF-8 and then as ISO-8859-1 (or latin1). In this case, you need to decode it in two steps:

def decode_doubly_encoded_utf8(s):
    bytes_s = s.encode('latin1')  # Convert to bytes using latin1 encoding
    return bytes_s.decode('utf8')  # Decode the bytes using utf8

# Test the function
s = "Thank you so much!🙏❤️"
print(decode_doubly_encoded_utf8(s))

This should print:

Thank you so much!🙏❤️

Please note that the actual output may vary depending on the environment where the code is run. Some environments may not support displaying certain Unicode characters.

2 Likes