Wouldn’t it be easier to call the Python tokenizer from C#?
This is how I did it using Ruby and it works fine for me, which I use for many tasks including (1) counting tokens in text and (2) creating logit_bias params.
Here is a random tutorial demonstrating how to call a Python script from C#. There are many others tutorials on the net on the topic:
HTH
Note, you can also call Python directly in a C# program using IronPython, FYI (but I have not tested it as it’s been a few years since I wrote C# code):
I figured it out in the end. I now have a tokenizer in native c# for 100k and 50k tiktoken files
The following page (and video) helped me understand what was needed, and then I wrote my own implementation.
The Rust and Python code was quite hard to follow and C# has Unicode UTF7 and UTF8 built-in. That made things a bit easier once I understood what was going on behind the scenes.
I don’t have an active github account at the moment. But if anyone wants the C# code, I’m happy to make it available. OpenAI are also, welcome to include it on the “How to count tokens” page if they want to.
I would very much like to see your implementation in C#. Personally I’m currently hacking away on a C# implementation for the Cl100k_base file. I’m close, and get mostly identical results to what tiktoken gets, but not entirely.
No stress! Not many hours ago I completed converting the encoding algorithm used in the tiktoken library to C#. So I do currently have something functional (yet not pretty, but who cares ).
I have not seen any Chinese samples but will happily add unit tests to my code with Chinese! So far I’ve only tried a simple test string with some casing and punctuation, but nothing fancy.