Markdown is 15% more token efficient than JSON

While that may be true for RLHF and supervised tuning, the “native language” on which BPE token encoder is tuned and optimized is learning corpus.

Intensely formatted markdown => minimized HTML: 309 to 546.
(a how-to-use-markdown demonstration document)

o200k is less efficient on both the English markdown and the HTML.

Here is an example document in Markdown, then HTML

Markdown Example Document

# My Markdown Document

## Introduction

Welcome to this example document. It demonstrates various Markdown formatting types.

### Formatting

**Bold text** can be created using double asterisks or double underscores.

*Italic text* can be created using single asterisks or single underscores.

~~Strikethrough text~~ is also possible using double tildes.

### Lists

#### Unordered List

- Item 1
- Item 2
  - Subitem 2.1
  - Subitem 2.2
- Item 3

#### Ordered List

1. First item
2. Second item
   1. Subitem 2.1
   2. Subitem 2.2
3. Third item

### Links and Images

[OpenAI](https://www.openai.com)

![OpenAI Logo](https://www.openai.com/assets/images/openai-logo.svg)

### Code

Inline code: `print("Hello, World!")`

Block of code:

\```python
def greet():
    print("Hello, World!")
\```

### Blockquotes

> This is a blockquote. It can span multiple lines.

### Tables

| Header 1 | Header 2 |
| -------- | -------- |
| Row 1    | Data 1   |
| Row 2    | Data 2   |

### Horizontal Rule

---

### Task List

- [x] Task 1
- [ ] Task 2
- [ ] Task 3

And here is the HTML version of the same document:

HTML Version

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>My Markdown Document</title>
</head>
<body>

<h1>My Markdown Document</h1>

<h2>Introduction</h2>

<p>Welcome to this example document. It demonstrates various Markdown formatting types.</p>

<h3>Formatting</h3>

<p><strong>Bold text</strong> can be created using double asterisks or double underscores.</p>

<p><em>Italic text</em> can be created using single asterisks or single underscores.</p>

<p><del>Strikethrough text</del> is also possible using double tildes.</p>

<h3>Lists</h3>

<h4>Unordered List</h4>
<ul>
    <li>Item 1</li>
    <li>Item 2
        <ul>
            <li>Subitem 2.1</li>
            <li>Subitem 2.2</li>
        </ul>
    </li>
    <li>Item 3</li>
</ul>

<h4>Ordered List</h4>
<ol>
    <li>First item</li>
    <li>Second item
        <ol>
            <li>Subitem 2.1</li>
            <li>Subitem 2.2</li>
        </ol>
    </li>
    <li>Third item</li>
</ol>

<h3>Links and Images</h3>

<p><a href="https://www.openai.com">OpenAI</a></p>

<p><img src="https://www.openai.com/assets/images/openai-logo.svg" alt="OpenAI Logo"></p>

<h3>Code</h3>

<p>Inline code: <code>print("Hello, World!")</code></p>

<p>Block of code:</p>

<pre><code>def greet():
    print("Hello, World!")
</code></pre>

<h3>Blockquotes</h3>

<blockquote>
    <p>This is a blockquote. It can span multiple lines.</p>
</blockquote>

<h3>Tables</h3>

<table>
    <thead>
        <tr>
            <th>Header 1</th>
            <th>Header 2</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Row 1</td>
            <td>Data 1</td>
        </tr>
        <tr>
            <td>Row 2</td>
            <td>Data 2</td>
        </tr>
    </tbody>
</table>

<h3>Horizontal Rule</h3>

<hr>

<h3>Task List</h3>

<ul>
    <li>[x] Task 1</li>
    <li>[ ] Task 2</li>
    <li>[ ] Task 3</li>
</ul>

</body>
</html>

Minimized HTML

<h1>My Markdown Document</h1><h2>Introduction</h2><p>Welcome to this example document. It demonstrates various Markdown formatting types.</p><h3>Formatting</h3><p><strong>Bold text</strong> can be created using double asterisks or double underscores.</p><p><em>Italic text</em> can be created using single asterisks or single underscores.</p><p><del>Strikethrough text</del> is also possible using double tildes.</p><h3>Lists</h3><h4>Unordered List</h4><ul> <li>Item 1</li> <li>Item 2 <ul> <li>Subitem 2.1</li><li>Subitem 2.2</li> </ul> </li> <li>Item 3</li></ul><h4>Ordered List</h4><ol> <li>First item</li><li>Second item <ol> <li>Subitem 2.1</li> <li>Subitem 2.2</li> </ol> </li> <li>Third item</li></ol><h3>Links and Images</h3><p><a href="https://www.openai.com">OpenAI</a></p><p><img src="https://www.openai.com/assets/images/openai-logo.svg" alt="OpenAI Logo"></p><h3>Code</h3><p>Inline code: <code>print("Hello, World!")</code></p><p>Block of code:</p><pre><code>def greet(): print("Hello, World!")</code></pre><h3>Blockquotes</h3><blockquote> <p>This is a blockquote. It can span multiple lines.</p></blockquote><h3>Tables</h3><table> <thead> <tr> <th>Header 1</th> <th>Header 2</th> </tr> </thead> <tbody> <tr> <td>Row 1</td> <td>Data 1</td> </tr> <tr> <td>Row 2</td> <td>Data 2</td> </tr> </tbody></table><h3>Horizontal Rule</h3><hr><h3>Task List</h3><ul> <li>[x] Task 1</li> <li>[ ] Task 2</li> <li>[ ] Task 3</li></ul>
1 Like