Does GPT-4 API interprets the encoded (UTF-8) emoji as literal emoji?

I’m currently performing text classification using GPT-4 via its API (8k context) my data has been accidentally encoded into UTF-8 during my data preprocessing. Like this:

Thank you so much!🙏❤️

which should be

Thank you so much!:pray::heart:

I’m new to this field, and currently conducting my undergrad thesis. I’m anxious if ever my panel asked me if GPT-4 understands that those are emojis or just gibberish symbols. huhu thank you

Yes, the model understands emoji’s and UTF-8 for sure. Example:


AI can’t fix your text:

The original text appears to contain emoji and special characters. Here’s the repaired version of the text:

The actual internal AI tokens of Unicode are bytes encoded as such:

Token Number Token Contents Rendered Unicode
35087 b’\xe6\xb8’
85301 b’\xe6\xb9’
36118 b’\xe6\xba’
73598 b’\xe6\xbb’ 滿
78257 b’\xe6\xbc’
164 b’\xe7’
66286 b’\xe7\x81’
32017 b’\xe7\x84’
85664 b’\xe7\x84\xa5\xe3\x81\x97\xe3\x81’ 焥し
76208 b’\xe7\x88’
17045 b’\xe7\x89’
29208 b’\xe7\x8e’
80243 b’\xe7\x8f’

There’s very few byte-sized symbols:
[15555, 1, ‘█’]
[22178, 2, ‘██’]
[52779, 4, ‘████’]
[85158, 1, ‘░’]
[47775, 1, ‘■’]
[83565, 1, ‘►’]
[45049, 1, ‘●’]
[27348, 1, ‘★’]
[49618, 2, ‘★★’]
[47239, 1, ‘☆’]
[67581, 1, ‘☴’]
[32991, 1, ‘:female_sign:’]
[40786, 2, ‘:female_sign::female_sign:’]
[88040, 4, ‘:female_sign::female_sign::female_sign::female_sign:’]
[77810, 1, ‘:heart:’]
[40621, 1, ‘♪’]
[66326, 3, ‘♪\n\n’]
[75352, 1, ‘:heavy_check_mark:’]
[40710, 1, ‘⟩’]

Here’s a table I had AI make, with only slight mangling to give you an idea how you should send and receive.

Rendered Python Escaped Sequence HTTP JSON Escape HTML Code
:grinning: \u1F600 \u1F600 😀
:star2: \u2B50 \u2B50 ⭐
:rocket: \u1F680 \u1F680 🚀
:turtle: \u1F422 \u1F422 🐢
:hibiscus: \u1F33A \u1F33A 🌺

extended Unicode emoji:

Rendered Python Escaped Sequence HTTP JSON Escape HTML Code
:yin_yang: \u262F\uFE0F \u262F\uFE0F ☯️
:jp: \U0001F1EF\U0001F1F5 \u{1F1EF}\u{1F1F5} 🇯🇵
:om: \u1F549\uFE0F \u1F549\uFE0F 🕉️
:earth_africa: \u1F30D\uFE0F \u1F30D\uFE0F 🌍️
:performing_arts: \u1F3AD\uFE0F \u1F3AD\uFE0F 🎭️

(the “http” I requested is actually stringified-for-json)

So you’ll likely need some processor to recognize the possible unicode bytes and convert them by chance and luck.


Hi, thank you for your response.

I’m not sure how to fix the encoded unicode. It happened when I saved the processed data using to_csv which has a default of UTF-8. If ever, i think i’ll just copy the emojis from the original texts?

i see, but would that change if we access the GPT-4 model via API?

The answer could be “ask an AI”, in this case gpt-4 prompted as my programming pal and decoding desperado:

Sure, you can use the codecs module in Python to decode the UTF-8 encoded string back to its original form. Here’s a simple function that does this:

import codecs

def decode_utf8(s):
    return codecs.decode(s, 'unicode_escape')

# Test the function
s = "Thank you so much!🙏❤️"

However, the string you provided seems to be doubly encoded, first as UTF-8 and then as ISO-8859-1 (or latin1). In this case, you need to decode it in two steps:

def decode_doubly_encoded_utf8(s):
    bytes_s = s.encode('latin1')  # Convert to bytes using latin1 encoding
    return bytes_s.decode('utf8')  # Decode the bytes using utf8

# Test the function
s = "Thank you so much!🙏❤️"

This should print:

Thank you so much!🙏❤️

Please note that the actual output may vary depending on the environment where the code is run. Some environments may not support displaying certain Unicode characters.