Hi all, I’m unable to decode any emojis (in form \ud83d\ude0e) in my chatgpt api responses no matter what encoding/deciding methods I use. Appreciate this may be more of a general python question, but I have no issues encoding/decoding python strings; I’m only having serious trouble with chatgpt api responses.
Note: some emojis print fine however, I only have issues with emojis in form \ud83d\ude0e.
Te below is my api function (few details removed) - can anybody suggest how I can ensure emojis (in forms such as \ud83d\ude0e) can be printed as actual emojis? FYI I’ve remove all my previous failed attempts at converting to actual emojis, so the current script just prints emojis in forms such as \ud83d\ude0e. Much appreciated
def Generate_GPT_Message(Chatlogs):
GPT_Response = None
try:
clientGPT = OpenAI(api_key = "REMOVED")
Chatgpt_prompt = [
{"role": "system", "content": f"You are a person, respond to the message and include an emoji at the end"}
]
Chatgpt_prompt.extend(Chatlogs)
chat_completion = clientGPT.chat.completions.create(
model="gpt-4o-mini",
messages=Chatgpt_prompt,
)
GPT_Response = chat_completion.choices[0].message.content
if GPT_Response.endswith('.'):
GPT_Response = GPT_Response[:-1]
except Exception as e:
logger.error(f"Unexpected error occurred: {str(e)}")
logger.info(GPT_Response)
return GPT_Response
For a more general case of encoding problems of escaped strings from various sources (although the AI should be trained on and exclusively be returning UTF-8), a function and response AI whipped up to my spec.
Python 3.8+, which you should be using for anything OpenAI, is UTF-8 native.
Detect and convert UTF-8, UTF-16, or UTF-32 encoded strings to printable Unicode, especially emoji.
To handle input strings that may contain UTF-8, UTF-16, or UTF-32 encoded characters and ensure proper printable emoji (including glyphs with gender modifiers and other combined sequences), you can create a function that:
Detects the encoding type.
Converts the string to Unicode (if necessary).
Handles surrogate pairs or combined sequences.
Ensures that the resulting string is printable in a Python environment.
Here’s a Python function that accomplishes this:
import codecs
def detect_and_convert_encoding(s):
"""
Detects and converts escaped strings in UTF-8, UTF-16, or UTF-32
to printable Unicode strings.
"""
try:
# Attempt to decode as UTF-8
return s.encode().decode('unicode-escape').encode('latin1').decode('utf-8')
except (UnicodeDecodeError, UnicodeEncodeError):
pass
try:
# Attempt to decode as UTF-16 with surrogate pairs handled
return s.encode().decode('unicode-escape').encode('latin1').decode('utf-16', 'surrogatepass')
except (UnicodeDecodeError, UnicodeEncodeError):
pass
try:
# Attempt to decode as UTF-32
return s.encode().decode('unicode-escape').encode('latin1').decode('utf-32')
except (UnicodeDecodeError, UnicodeEncodeError):
pass
# If no decoding worked, return the original string (unlikely scenario)
return s
# Example usage:
emoji_utf16 = "\ud83d\ude0e" # UTF-16 surrogate pair for 😎 emoji
converted_str = detect_and_convert_encoding(emoji_utf16)
print(converted_str) # Should print 😎
# Example for UTF-8 input (already encoded emoji)
emoji_utf8 = "😎"
converted_str_utf8 = detect_and_convert_encoding(emoji_utf8)
print(converted_str_utf8) # Should print 😎
# Example for combined emoji with gender modifier
emoji_combined = "\ud83d\udc69\u200d\ud83c\udfeb" # Female teacher emoji
converted_combined_str = detect_and_convert_encoding(emoji_combined)
print(converted_combined_str) # Should print 👩🏫
Explanation:
The function first tries to decode the string assuming it’s UTF-8, using .encode() to handle any escaped sequences and .decode('unicode-escape') to process the escape sequences.
If UTF-8 decoding fails, it attempts to decode as UTF-16 using decode('utf-16', 'surrogatepass') to handle surrogate pairs (common with emoji).
If UTF-16 fails, it attempts to decode as UTF-32.
The function returns a properly decoded and printable string.
Why this works:
Emoji characters (even those involving surrogate pairs or complex sequences like gender modifiers) are correctly decoded based on the input encoding.
The function handles cases where the input may be escaped or encoded in different formats and ensures a consistent, printable Unicode result.
This approach ensures that the function works for various input encodings and that emoji sequences are rendered correctly.
Thank you both for the solutions, much appreciated.
The below code works on my windows 10 + vscode setup, and all the emojis are printed correctly in terminal:
Detect and convert UTF-8, UTF-16, or UTF-32 encoded strings to printable Unicode, especially emoji.
However, I’m still getting characters like \ud83d\ude2c printed in terminal after receiving the chatgpt api response. I’ve even tried this on windows 11 and mac but no difference unfortunately
It is still possible that the AI itself is producing and responding with those as strings, instead of outputting byte tokens that represent the unicode.
I’ve not seen the AI make mistakes in emoji vs escape sequences unless you are asking for code ambiguously.
Logprobs that showed token numbers instead of characters would be incredibly useful for a dozen different cases, such as this.
You might want to run a few lines of chat completions at the python console, or in IDLE shell where you have more IDE-like features in the REPL interface. Then with your response object of .message.contents, obtain what was actually sent to you.
You can also just break your code if not within a function, or set the response as global, to examine the state.
This is just me typing from memory in python, to demonstrate when I specifically ask for both the emoji and the string. Then do some dirty debugging of the AI output like you might explore.
import openai, json, pydantic, tiktoken
cl = openai.Client()
r = cl.chat.completions.create(model="gpt-4", messages=[{"role": "user", "content": "You purpose is to output test characters and nothing else. Reproduce the rendered emoji for a smiley face. On the next line, output the \\u python unicode escape sequence that would produce that emoji if placed in a string within code."}], max_tokens=40)
rx = r.choices[0].message.content.encode('utf-8', errors='strict')