How do you make a bpe file for Tokenizer

Does anyone know how to convert a tiktoken file into a matching bpe (vocab) file

I need to make a tokenizer in C# for 3.5-turbo. I have figured out how to read the tiktoken file to create a dictionary

I understand that we use BytePairEncoding

But I cant figure out how to build the bperanks list. The GPT2 and 3 libraries used a vocab.pbe file

Here is a link to the GPT3.5 library, but it is written in Python AND optimized for speed (so very hard to understand)

github

And the only tokenizer for cl100k is

tiktoken

Even if someone has another way of encoding cl100k, I would be super happy

1 Like

Wouldn’t it be easier to call the Python tokenizer from C#?

This is how I did it using Ruby and it works fine for me, which I use for many tasks including (1) counting tokens in text and (2) creating logit_bias params.

Here is a random tutorial demonstrating how to call a Python script from C#. There are many others tutorials on the net on the topic:

HTH

:slight_smile:

Note, you can also call Python directly in a C# program using IronPython, FYI (but I have not tested it as it’s been a few years since I wrote C# code):

https://ironpython.net/

1 Like

Thanks for the info. I was thinking of doing that but couldn’t find any useful info to help.

1 Like

Welcome …

See my “late to the party” post edit on IronPython above. This is an approach you might like as well.

:slight_smile:

I figured it out in the end. I now have a tokenizer in native c# for 100k and 50k tiktoken files

The following page (and video) helped me understand what was needed, and then I wrote my own implementation.

The Rust and Python code was quite hard to follow and C# has Unicode UTF7 and UTF8 built-in. That made things a bit easier once I understood what was going on behind the scenes.

I don’t have an active github account at the moment. But if anyone wants the C# code, I’m happy to make it available. OpenAI are also, welcome to include it on the “How to count tokens” page if they want to.

1 Like

Hi Raymond!

I would very much like to see your implementation in C#. Personally I’m currently hacking away on a C# implementation for the Cl100k_base file. I’m close, and get mostly identical results to what tiktoken gets, but not entirely.

Br

Sorry I’ve been off the radar for a few days. I’ll get you a link later today

No stress! Not many hours ago I completed converting the encoding algorithm used in the tiktoken library to C#. So I do currently have something functional (yet not pretty, but who cares :man_shrugging:).

I’d still love to see your approach!

I sent you a private message with my email address. Did you manage to get the Chinese examples to work?

I have not seen any Chinese samples but will happily add unit tests to my code with Chinese! So far I’ve only tried a simple test string with some casing and punctuation, but nothing fancy.

There are 50k and 100k test examples at the bottom of this link

Would you mind sharing your C# 100k tokenizer implementation with me as well? Been searching for one.

The tokenizer can be downloaded from this page

There is a tokenizer for GPT3.5 and 4 (cl100k)
and a tokenizer for Davinci (pk50k)

The cl100k also applies to ADA002 for embedding

You will need to change the hardcoded path to the tiktoken files. This has been extracted from a larger MVC project.

2 Likes