Glitch Tokens in GPT-4o: Seeking Clarification

I first noticed a potential issue with the GPT-4o ‘o200k_base’ tokenizer based on discussions from others on Twitter. The issue involves some tokens related to adult content and gambling causing unexpected model responses, known as glitch tokens. Here’s a summary of my findings and questions:

Identified Issues:

  1. NSFW Tokens: The tokenizer includes tokens like ‘출장안마’ (escort service), ‘바카라’ (baccarat), and ‘출장샵’ (escort club) in Korean. Similar cases have been reported among Chinese and Japanese tokens. These are not inherently problematic, but I’m curious about the rationale behind their inclusion.
  2. Glitch Tokens: Some tokens, especially in Chinese, cause the model to return irrelevant answers. For example:
    Chinese Token: 114900 (最新高清无码, “latest uncensored HD content”) prompts unrelated responses like dream analysis, fabric types, or continuing education.

Request for Feedback:

  1. Has anyone else encountered similar issues with the GPT-4o tokenizer?
  2. Can someone from the OpenAI team clarify if including these specific NSFW tokens was intentional?

References and Additional Resources:

  1. Original Issue Seen by @suchenzang (x.com) and @_aixile (x.com) on Twitter.
  2. Script for Token Inspection: tokenizer/gpt4o_tokenizer_inspection.py at master · simpleusername96/tokenizer · GitHub
  3. Glitch Tokens Article: https://www.lesswrong.com/posts/aPeJE8bSo6rAFoLqg/solidgoldmagikarp-plus-prompt-generation
  4. Glitch Tokens Example 1: https://chat.openai.com/share/9f434a84-b316-4114-a185-c8c7ebc1b496
  5. Glitch Tokens Example 2: https://chat.openai.com/share/a120e88a-eb66-4639-96f9-f1574f7577ea
  6. Glitch Tokens Example 3: https://chat.openai.com/share/d80d349d-315f-49b7-b84b-c5eb3d35d9dc
    Thank you for your attention. I look forward to any insights or feedback!