Markdown is 15% more token efficient than JSON

ashdragon · June 26, 2024, 9:01pm

I took a large-ish JSON file and converted it to TOML, YAML, and Markdown, and then counted tokens using tiktoken

Original JSON: 13,869 tokens
TOML: 12,503 tokens
YAML: 12,333 tokens
Markdown: 11,612 tokens

The project I’m working on right now, I wind up having to send the same chunk of data several times to split the response up, because it exceeds the the maximum limit for a response (different than the max_tokens parameter). To get those 13,869 output tokens, I sent 92,945 tokens but from only 58,473 tokens worth of prompts! Converting to markdown could save me 20-30% overall.

Diet · June 26, 2024, 9:05pm

Cool idea! It’s a tiny optimization, but might make a real difference if you got tons of volume. Thanks for sharing!

One reservation I have would be that it probably doesn’t work for everything. But definitely warrants further investigation! Do you have examples of the docs you converted?

anon22939549 · June 27, 2024, 12:43am

Yeah, I think we’ve discussed this around here before.

At least certainly in the context of Markdown vs \LaTeX.

Markdown does sacrifice quite a bit of flexibility in exchange for being substantially more compact.

Markdown is also the “native” language of most LLMs, as such if will tend to tokenize better too.

_j · June 27, 2024, 3:29am

While that may be true for RLHF and supervised tuning, the “native language” on which BPE token encoder is tuned and optimized is learning corpus.

Intensely formatted markdown => minimized HTML: 309 to 546.
(a how-to-use-markdown demonstration document)

o200k is less efficient on both the English markdown and the HTML.

Here is an example document in Markdown, then HTML

Markdown Example Document

# My Markdown Document

## Introduction

Welcome to this example document. It demonstrates various Markdown formatting types.

### Formatting

**Bold text** can be created using double asterisks or double underscores.

*Italic text* can be created using single asterisks or single underscores.

~~Strikethrough text~~ is also possible using double tildes.

### Lists

#### Unordered List

- Item 1
- Item 2
  - Subitem 2.1
  - Subitem 2.2
- Item 3

#### Ordered List

1. First item
2. Second item
   1. Subitem 2.1
   2. Subitem 2.2
3. Third item

### Links and Images

[OpenAI](https://www.openai.com)

![OpenAI Logo](https://www.openai.com/assets/images/openai-logo.svg)

### Code

Inline code: `print("Hello, World!")`

Block of code:

\```python
def greet():
    print("Hello, World!")
\```

### Blockquotes

> This is a blockquote. It can span multiple lines.

### Tables

| Header 1 | Header 2 |
| -------- | -------- |
| Row 1    | Data 1   |
| Row 2    | Data 2   |

### Horizontal Rule

---

### Task List

- [x] Task 1
- [ ] Task 2
- [ ] Task 3

And here is the HTML version of the same document:

HTML Version

<!DOCTYPE html>
<html lang="en">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>My Markdown Document</title>
</head>
<body>

<h1>My Markdown Document</h1>

<h2>Introduction</h2>

<p>Welcome to this example document. It demonstrates various Markdown formatting types.</p>

<h3>Formatting</h3>

<p><strong>Bold text</strong> can be created using double asterisks or double underscores.</p>

<p><em>Italic text</em> can be created using single asterisks or single underscores.</p>

<p><del>Strikethrough text</del> is also possible using double tildes.</p>

<h3>Lists</h3>

<h4>Unordered List</h4>
<ul>
    <li>Item 1</li>
    <li>Item 2
        <ul>
            <li>Subitem 2.1</li>
            <li>Subitem 2.2</li>
        </ul>
    </li>
    <li>Item 3</li>
</ul>

<h4>Ordered List</h4>
<ol>
    <li>First item</li>
    <li>Second item
        <ol>
            <li>Subitem 2.1</li>
            <li>Subitem 2.2</li>
        </ol>
    </li>
    <li>Third item</li>
</ol>

<h3>Links and Images</h3>

<p><a href="https://www.openai.com">OpenAI</a></p>

<p><img src="https://www.openai.com/assets/images/openai-logo.svg" alt="OpenAI Logo"></p>

<h3>Code</h3>

<p>Inline code: <code>print("Hello, World!")</code></p>

<p>Block of code:</p>

<pre><code>def greet():
    print("Hello, World!")
</code></pre>

<h3>Blockquotes</h3>

<blockquote>
    <p>This is a blockquote. It can span multiple lines.</p>
</blockquote>

<h3>Tables</h3>

<table>
    <thead>
        <tr>
            <th>Header 1</th>
            <th>Header 2</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>Row 1</td>
            <td>Data 1</td>
        </tr>
        <tr>
            <td>Row 2</td>
            <td>Data 2</td>
        </tr>
    </tbody>
</table>

<h3>Horizontal Rule</h3>

<hr>

<h3>Task List</h3>

<ul>
    <li>[x] Task 1</li>
    <li>[ ] Task 2</li>
    <li>[ ] Task 3</li>
</ul>

</body>
</html>

Minimized HTML

<h1>My Markdown Document</h1><h2>Introduction</h2><p>Welcome to this example document. It demonstrates various Markdown formatting types.</p><h3>Formatting</h3><p><strong>Bold text</strong> can be created using double asterisks or double underscores.</p><p><em>Italic text</em> can be created using single asterisks or single underscores.</p><p><del>Strikethrough text</del> is also possible using double tildes.</p><h3>Lists</h3><h4>Unordered List</h4><ul> <li>Item 1</li> <li>Item 2 <ul> <li>Subitem 2.1</li><li>Subitem 2.2</li> </ul> </li> <li>Item 3</li></ul><h4>Ordered List</h4><ol> <li>First item</li><li>Second item <ol> <li>Subitem 2.1</li> <li>Subitem 2.2</li> </ol> </li> <li>Third item</li></ol><h3>Links and Images</h3><p><a href="https://www.openai.com">OpenAI</a></p><p><img src="https://www.openai.com/assets/images/openai-logo.svg" alt="OpenAI Logo"></p><h3>Code</h3><p>Inline code: <code>print("Hello, World!")</code></p><p>Block of code:</p><pre><code>def greet(): print("Hello, World!")</code></pre><h3>Blockquotes</h3><blockquote> <p>This is a blockquote. It can span multiple lines.</p></blockquote><h3>Tables</h3><table> <thead> <tr> <th>Header 1</th> <th>Header 2</th> </tr> </thead> <tbody> <tr> <td>Row 1</td> <td>Data 1</td> </tr> <tr> <td>Row 2</td> <td>Data 2</td> </tr> </tbody></table><h3>Horizontal Rule</h3><hr><h3>Task List</h3><ul> <li>[x] Task 1</li> <li>[ ] Task 2</li> <li>[ ] Task 3</li></ul>

hagen · June 27, 2024, 9:11am

Thanks for sharing! What kind of Format have you used when using Markdown instead of JSON? How could you manage to still have a key-value scheme that way?

ashdragon · June 27, 2024, 7:07pm

I’ve been generating nested JSON objects that have arrays at the leaf nodes at varying levels, over the course of multiple generations. For the Markdown, I’m using headings to indicate keys, and the level of heading to indicate the level of nesting. And then the arrays are Markdown lists.
So, this:

{
  "Characters": {
    "Aide de camp": {
      "4": {
        "Personality": ["Disdainful"],
        "Mood": ["Disdainful"],
        "Relationships": [
                "Subordinate to Authand",
                "Interacts with Kalia"
        ]
      }
    }
  }
}

becomes:

# Characters
## Aide de camp
### 4
#### Personality
- Disdainful
#### Mood
- Disdainful
#### Relationships
- Subordinate to Authand
- Interacts with Kalia

And then I made a converter to turn it into JSON afterward. Bonus points, It’s a lot easier to combine multiple chunks of Markdown text than it is multiple chunks of stringified JSON

ashdragon · June 27, 2024, 7:11pm

I’m using headings as keys, with the heading level to indicate nesting. Then whatever is under the heading is the value. I still have to define a schema to use, and not mix data types in a single value, because I’m converting it to JSON after, but it works pretty well so far

btfranklin · June 27, 2024, 7:32pm

This is a very interesting discovery! I personally have been finding Markdown to be my format of choice for most of my more complex LLM interactions, but this gives a nice concrete basis to justify that choice.

Somewhat tangentially-related, you might find use in my “promptdown” package which defines a specialized way of expressing prompt templates in a markdown sub-format:

I’ve been using that format to externalize re-usable prompt templates, and I’m finding it a lot easier to read than JSON for this purpose.

hagen · June 27, 2024, 8:11pm

Markdown could definitely fix the “Broken-JSON”-issue, where still sometimes double quotes appear unescaped within a string. In JSON, this breaks any parser. In markdown it wouldn’t matter at all

Topic		Replies	Views
Has anyone found that using markdown in the prompt makes a difference? API chatgpt , api	5	3095	January 13, 2025
Can't seem to eliminate markdown format API gpt-4 , gpt-4-vision	15	3618	February 21, 2025
XML vs Markdown for high performance tasks Prompting custom-instructions	3	1478	May 15, 2025
Best structure lang for input to MD output API	3	68	July 13, 2025
GPT-4-o returns answers with Markdown Prompting gpt-4 , output-markdown , gpt-4o	15	10364	July 12, 2024

Markdown is 15% more token efficient than JSON

Markdown Example Document

HTML Version

Minimized HTML

Related topics