While that may be true for RLHF and supervised tuning, the “native language” on which BPE token encoder is tuned and optimized is learning corpus.
Intensely formatted markdown => minimized HTML: 309 to 546.
(a how-to-use-markdown demonstration document)
o200k is less efficient on both the English markdown and the HTML.
Here is an example document in Markdown, then HTML
Markdown Example Document
# My Markdown Document
## Introduction
Welcome to this example document. It demonstrates various Markdown formatting types.
### Formatting
**Bold text** can be created using double asterisks or double underscores.
*Italic text* can be created using single asterisks or single underscores.
~~Strikethrough text~~ is also possible using double tildes.
### Lists
#### Unordered List
- Item 1
- Item 2
- Subitem 2.1
- Subitem 2.2
- Item 3
#### Ordered List
1. First item
2. Second item
1. Subitem 2.1
2. Subitem 2.2
3. Third item
### Links and Images
[OpenAI](https://www.openai.com)

### Code
Inline code: `print("Hello, World!")`
Block of code:
\```python
def greet():
print("Hello, World!")
\```
### Blockquotes
> This is a blockquote. It can span multiple lines.
### Tables
| Header 1 | Header 2 |
| -------- | -------- |
| Row 1 | Data 1 |
| Row 2 | Data 2 |
### Horizontal Rule
---
### Task List
- [x] Task 1
- [ ] Task 2
- [ ] Task 3
And here is the HTML version of the same document:
HTML Version
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>My Markdown Document</title>
</head>
<body>
<h1>My Markdown Document</h1>
<h2>Introduction</h2>
<p>Welcome to this example document. It demonstrates various Markdown formatting types.</p>
<h3>Formatting</h3>
<p><strong>Bold text</strong> can be created using double asterisks or double underscores.</p>
<p><em>Italic text</em> can be created using single asterisks or single underscores.</p>
<p><del>Strikethrough text</del> is also possible using double tildes.</p>
<h3>Lists</h3>
<h4>Unordered List</h4>
<ul>
<li>Item 1</li>
<li>Item 2
<ul>
<li>Subitem 2.1</li>
<li>Subitem 2.2</li>
</ul>
</li>
<li>Item 3</li>
</ul>
<h4>Ordered List</h4>
<ol>
<li>First item</li>
<li>Second item
<ol>
<li>Subitem 2.1</li>
<li>Subitem 2.2</li>
</ol>
</li>
<li>Third item</li>
</ol>
<h3>Links and Images</h3>
<p><a href="https://www.openai.com">OpenAI</a></p>
<p><img src="https://www.openai.com/assets/images/openai-logo.svg" alt="OpenAI Logo"></p>
<h3>Code</h3>
<p>Inline code: <code>print("Hello, World!")</code></p>
<p>Block of code:</p>
<pre><code>def greet():
print("Hello, World!")
</code></pre>
<h3>Blockquotes</h3>
<blockquote>
<p>This is a blockquote. It can span multiple lines.</p>
</blockquote>
<h3>Tables</h3>
<table>
<thead>
<tr>
<th>Header 1</th>
<th>Header 2</th>
</tr>
</thead>
<tbody>
<tr>
<td>Row 1</td>
<td>Data 1</td>
</tr>
<tr>
<td>Row 2</td>
<td>Data 2</td>
</tr>
</tbody>
</table>
<h3>Horizontal Rule</h3>
<hr>
<h3>Task List</h3>
<ul>
<li>[x] Task 1</li>
<li>[ ] Task 2</li>
<li>[ ] Task 3</li>
</ul>
</body>
</html>
Minimized HTML
<h1>My Markdown Document</h1><h2>Introduction</h2><p>Welcome to this example document. It demonstrates various Markdown formatting types.</p><h3>Formatting</h3><p><strong>Bold text</strong> can be created using double asterisks or double underscores.</p><p><em>Italic text</em> can be created using single asterisks or single underscores.</p><p><del>Strikethrough text</del> is also possible using double tildes.</p><h3>Lists</h3><h4>Unordered List</h4><ul> <li>Item 1</li> <li>Item 2 <ul> <li>Subitem 2.1</li><li>Subitem 2.2</li> </ul> </li> <li>Item 3</li></ul><h4>Ordered List</h4><ol> <li>First item</li><li>Second item <ol> <li>Subitem 2.1</li> <li>Subitem 2.2</li> </ol> </li> <li>Third item</li></ol><h3>Links and Images</h3><p><a href="https://www.openai.com">OpenAI</a></p><p><img src="https://www.openai.com/assets/images/openai-logo.svg" alt="OpenAI Logo"></p><h3>Code</h3><p>Inline code: <code>print("Hello, World!")</code></p><p>Block of code:</p><pre><code>def greet(): print("Hello, World!")</code></pre><h3>Blockquotes</h3><blockquote> <p>This is a blockquote. It can span multiple lines.</p></blockquote><h3>Tables</h3><table> <thead> <tr> <th>Header 1</th> <th>Header 2</th> </tr> </thead> <tbody> <tr> <td>Row 1</td> <td>Data 1</td> </tr> <tr> <td>Row 2</td> <td>Data 2</td> </tr> </tbody></table><h3>Horizontal Rule</h3><hr><h3>Task List</h3><ul> <li>[x] Task 1</li> <li>[ ] Task 2</li> <li>[ ] Task 3</li></ul>