How do you make a bpe file for Tokenizer

raymonddavey · March 11, 2023, 10:44pm

Does anyone know how to convert a tiktoken file into a matching bpe (vocab) file

I need to make a tokenizer in C# for 3.5-turbo. I have figured out how to read the tiktoken file to create a dictionary

I understand that we use BytePairEncoding

But I cant figure out how to build the bperanks list. The GPT2 and 3 libraries used a vocab.pbe file

Here is a link to the GPT3.5 library, but it is written in Python AND optimized for speed (so very hard to understand)

github

And the only tokenizer for cl100k is

tiktoken

Even if someone has another way of encoding cl100k, I would be super happy

ruby_coder · March 12, 2023, 4:22am

Wouldn’t it be easier to call the Python tokenizer from C#?

This is how I did it using Ruby and it works fine for me, which I use for many tasks including (1) counting tokens in text and (2) creating logit_bias params.

Here is a random tutorial demonstrating how to call a Python script from C#. There are many others tutorials on the net on the topic:

HTH

Note, you can also call Python directly in a C# program using IronPython, FYI (but I have not tested it as it’s been a few years since I wrote C# code):

https://ironpython.net/

raymonddavey · March 12, 2023, 4:24am

Thanks for the info. I was thinking of doing that but couldn’t find any useful info to help.

ruby_coder · March 12, 2023, 4:25am

Welcome …

See my “late to the party” post edit on IronPython above. This is an approach you might like as well.

raymonddavey · March 12, 2023, 7:10pm

I figured it out in the end. I now have a tokenizer in native c# for 100k and 50k tiktoken files

The following page (and video) helped me understand what was needed, and then I wrote my own implementation.

The Rust and Python code was quite hard to follow and C# has Unicode UTF7 and UTF8 built-in. That made things a bit easier once I understood what was going on behind the scenes.

I don’t have an active github account at the moment. But if anyone wants the C# code, I’m happy to make it available. OpenAI are also, welcome to include it on the “How to count tokens” page if they want to.

salvador1 · March 15, 2023, 1:51pm

Hi Raymond!

I would very much like to see your implementation in C#. Personally I’m currently hacking away on a C# implementation for the Cl100k_base file. I’m close, and get mostly identical results to what tiktoken gets, but not entirely.

Br

raymonddavey · March 16, 2023, 6:50pm

Sorry I’ve been off the radar for a few days. I’ll get you a link later today

salvador1 · March 16, 2023, 7:03pm

No stress! Not many hours ago I completed converting the encoding algorithm used in the tiktoken library to C#. So I do currently have something functional (yet not pretty, but who cares ).

I’d still love to see your approach!

raymonddavey · March 16, 2023, 7:06pm

I sent you a private message with my email address. Did you manage to get the Chinese examples to work?

salvador1 · March 16, 2023, 7:19pm

I have not seen any Chinese samples but will happily add unit tests to my code with Chinese! So far I’ve only tried a simple test string with some casing and punctuation, but nothing fancy.

raymonddavey · March 16, 2023, 7:31pm

There are 50k and 100k test examples at the bottom of this link

github.com

openai/openai-cookbook/blob/main/examples/How_to_count_tokens_with_tiktoken.ipynb

{
 "cells": [
  {
   "attachments": {},
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# How to count tokens with tiktoken\n",
    "\n",
    "[`tiktoken`](https://github.com/openai/tiktoken/blob/main/README.md) is a fast open-source tokenizer by OpenAI.\n",
    "\n",
    "Given a text string (e.g., `\"tiktoken is great!\"`) and an encoding (e.g., `\"cl100k_base\"`), a tokenizer can split the text string into a list of tokens (e.g., `[\"t\", \"ik\", \"token\", \" is\", \" great\", \"!\"]`).\n",
    "\n",
    "Splitting text strings into tokens is useful because GPT models see text in the form of tokens. Knowing how many tokens are in a text string can tell you (a) whether the string is too long for a text model to process and (b) how much an OpenAI API call costs (as usage is priced by token). Different models use different encodings.\n",
    "\n",
    "\n",
    "## Encodings\n",
    "\n",
    "Encodings specify how text is converted into tokens. Different models use different encodings.\n",
    "\n",

This file has been truncated. show original

anthonyosx · March 23, 2023, 2:34am

Would you mind sharing your C# 100k tokenizer implementation with me as well? Been searching for one.

raymonddavey · March 23, 2023, 4:25am

The tokenizer can be downloaded from this page

There is a tokenizer for GPT3.5 and 4 (cl100k)
and a tokenizer for Davinci (pk50k)

The cl100k also applies to ADA002 for embedding

You will need to change the hardcoded path to the tiktoken files. This has been extracted from a larger MVC project.

Topic		Replies	Views
My simple implementation is 10x faster than tiktoken. Anything wrong? Community api	6	4566	October 17, 2023
TikToken.GetEncoding Hangs or Freezes Bugs	6	84	January 30, 2025
Counting tokens for chat API calls (gpt-3.5-turbo) Documentation	5	26803	December 13, 2023
Counting Tokens and Rendering Content in HTML (Not the tags) Prompting chatgpt , api , token	6	1630	October 19, 2023
NewConnectionError keeps coming up over a .tiktoken file API chatgpt , plugin-development , api	5	6101	June 19, 2024

How do you make a bpe file for Tokenizer

Related topics