I took a large-ish JSON file and converted it to TOML, YAML, and Markdown, and then counted tokens using tiktoken
Original JSON: 13,869 tokens
TOML: 12,503 tokens
YAML: 12,333 tokens
Markdown: 11,612 tokens
The project I’m working on right now, I wind up having to send the same chunk of data several times to split the response up, because it exceeds the the maximum limit for a response (different than the max_tokens parameter). To get those 13,869 output tokens, I sent 92,945 tokens but from only 58,473 tokens worth of prompts! Converting to markdown could save me 20-30% overall.
Cool idea! It’s a tiny optimization, but might make a real difference if you got tons of volume. Thanks for sharing!
One reservation I have would be that it probably doesn’t work for everything. But definitely warrants further investigation! Do you have examples of the docs you converted?
While that may be true for RLHF and supervised tuning, the “native language” on which BPE token encoder is tuned and optimized is learning corpus.
Intensely formatted markdown => minimized HTML: 309 to 546.
(a how-to-use-markdown demonstration document)
o200k is less efficient on both the English markdown and the HTML.
Here is an example document in Markdown, then HTML
Markdown Example Document
# My Markdown Document
## Introduction
Welcome to this example document. It demonstrates various Markdown formatting types.
### Formatting
**Bold text** can be created using double asterisks or double underscores.
*Italic text* can be created using single asterisks or single underscores.
~~Strikethrough text~~ is also possible using double tildes.
### Lists
#### Unordered List
- Item 1
- Item 2
- Subitem 2.1
- Subitem 2.2
- Item 3
#### Ordered List
1. First item
2. Second item
1. Subitem 2.1
2. Subitem 2.2
3. Third item
### Links and Images
[OpenAI](https://www.openai.com)
![OpenAI Logo](https://www.openai.com/assets/images/openai-logo.svg)
### Code
Inline code: `print("Hello, World!")`
Block of code:
\```python
def greet():
print("Hello, World!")
\```
### Blockquotes
> This is a blockquote. It can span multiple lines.
### Tables
| Header 1 | Header 2 |
| -------- | -------- |
| Row 1 | Data 1 |
| Row 2 | Data 2 |
### Horizontal Rule
---
### Task List
- [x] Task 1
- [ ] Task 2
- [ ] Task 3
And here is the HTML version of the same document:
HTML Version
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>My Markdown Document</title>
</head>
<body>
<h1>My Markdown Document</h1>
<h2>Introduction</h2>
<p>Welcome to this example document. It demonstrates various Markdown formatting types.</p>
<h3>Formatting</h3>
<p><strong>Bold text</strong> can be created using double asterisks or double underscores.</p>
<p><em>Italic text</em> can be created using single asterisks or single underscores.</p>
<p><del>Strikethrough text</del> is also possible using double tildes.</p>
<h3>Lists</h3>
<h4>Unordered List</h4>
<ul>
<li>Item 1</li>
<li>Item 2
<ul>
<li>Subitem 2.1</li>
<li>Subitem 2.2</li>
</ul>
</li>
<li>Item 3</li>
</ul>
<h4>Ordered List</h4>
<ol>
<li>First item</li>
<li>Second item
<ol>
<li>Subitem 2.1</li>
<li>Subitem 2.2</li>
</ol>
</li>
<li>Third item</li>
</ol>
<h3>Links and Images</h3>
<p><a href="https://www.openai.com">OpenAI</a></p>
<p><img src="https://www.openai.com/assets/images/openai-logo.svg" alt="OpenAI Logo"></p>
<h3>Code</h3>
<p>Inline code: <code>print("Hello, World!")</code></p>
<p>Block of code:</p>
<pre><code>def greet():
print("Hello, World!")
</code></pre>
<h3>Blockquotes</h3>
<blockquote>
<p>This is a blockquote. It can span multiple lines.</p>
</blockquote>
<h3>Tables</h3>
<table>
<thead>
<tr>
<th>Header 1</th>
<th>Header 2</th>
</tr>
</thead>
<tbody>
<tr>
<td>Row 1</td>
<td>Data 1</td>
</tr>
<tr>
<td>Row 2</td>
<td>Data 2</td>
</tr>
</tbody>
</table>
<h3>Horizontal Rule</h3>
<hr>
<h3>Task List</h3>
<ul>
<li>[x] Task 1</li>
<li>[ ] Task 2</li>
<li>[ ] Task 3</li>
</ul>
</body>
</html>
Minimized HTML
<h1>My Markdown Document</h1><h2>Introduction</h2><p>Welcome to this example document. It demonstrates various Markdown formatting types.</p><h3>Formatting</h3><p><strong>Bold text</strong> can be created using double asterisks or double underscores.</p><p><em>Italic text</em> can be created using single asterisks or single underscores.</p><p><del>Strikethrough text</del> is also possible using double tildes.</p><h3>Lists</h3><h4>Unordered List</h4><ul> <li>Item 1</li> <li>Item 2 <ul> <li>Subitem 2.1</li><li>Subitem 2.2</li> </ul> </li> <li>Item 3</li></ul><h4>Ordered List</h4><ol> <li>First item</li><li>Second item <ol> <li>Subitem 2.1</li> <li>Subitem 2.2</li> </ol> </li> <li>Third item</li></ol><h3>Links and Images</h3><p><a href="https://www.openai.com">OpenAI</a></p><p><img src="https://www.openai.com/assets/images/openai-logo.svg" alt="OpenAI Logo"></p><h3>Code</h3><p>Inline code: <code>print("Hello, World!")</code></p><p>Block of code:</p><pre><code>def greet(): print("Hello, World!")</code></pre><h3>Blockquotes</h3><blockquote> <p>This is a blockquote. It can span multiple lines.</p></blockquote><h3>Tables</h3><table> <thead> <tr> <th>Header 1</th> <th>Header 2</th> </tr> </thead> <tbody> <tr> <td>Row 1</td> <td>Data 1</td> </tr> <tr> <td>Row 2</td> <td>Data 2</td> </tr> </tbody></table><h3>Horizontal Rule</h3><hr><h3>Task List</h3><ul> <li>[x] Task 1</li> <li>[ ] Task 2</li> <li>[ ] Task 3</li></ul>
Thanks for sharing! What kind of Format have you used when using Markdown instead of JSON? How could you manage to still have a key-value scheme that way?
I’ve been generating nested JSON objects that have arrays at the leaf nodes at varying levels, over the course of multiple generations. For the Markdown, I’m using headings to indicate keys, and the level of heading to indicate the level of nesting. And then the arrays are Markdown lists.
So, this:
{
"Characters": {
"Aide de camp": {
"4": {
"Personality": ["Disdainful"],
"Mood": ["Disdainful"],
"Relationships": [
"Subordinate to Authand",
"Interacts with Kalia"
]
}
}
}
}
becomes:
# Characters
## Aide de camp
### 4
#### Personality
- Disdainful
#### Mood
- Disdainful
#### Relationships
- Subordinate to Authand
- Interacts with Kalia
And then I made a converter to turn it into JSON afterward. Bonus points, It’s a lot easier to combine multiple chunks of Markdown text than it is multiple chunks of stringified JSON
I’m using headings as keys, with the heading level to indicate nesting. Then whatever is under the heading is the value. I still have to define a schema to use, and not mix data types in a single value, because I’m converting it to JSON after, but it works pretty well so far
This is a very interesting discovery! I personally have been finding Markdown to be my format of choice for most of my more complex LLM interactions, but this gives a nice concrete basis to justify that choice.
Somewhat tangentially-related, you might find use in my “promptdown” package which defines a specialized way of expressing prompt templates in a markdown sub-format:
I’ve been using that format to externalize re-usable prompt templates, and I’m finding it a lot easier to read than JSON for this purpose.
Markdown could definitely fix the “Broken-JSON”-issue, where still sometimes double quotes appear unescaped within a string. In JSON, this breaks any parser. In markdown it wouldn’t matter at all