Does anyone know how to convert a tiktoken file into a matching bpe (vocab) file
I need to make a tokenizer in C# for 3.5-turbo. I have figured out how to read the tiktoken file to create a dictionary
I understand that we use BytePairEncoding
But I cant figure out how to build the bperanks list. The GPT2 and 3 libraries used a vocab.pbe file
Here is a link to the GPT3.5 library, but it is written in Python AND optimized for speed (so very hard to understand)
And the only tokenizer for cl100k is
Even if someone has another way of encoding cl100k, I would be super happy
Wouldn’t it be easier to call the Python tokenizer from C#?
This is how I did it using Ruby and it works fine for me, which I use for many tasks including (1) counting tokens in text and (2) creating
For those of you who use Ruby, here is how I use TikToken in my Ruby code since there is no Ruby tiktoken gem for the turbo model (yet). Just a quick hack…You can make this more elegant if you wish, of course. I just tossed this salad together today for some testing:
Create a Python script, like this (after installing tiktoken, of course)
encoding = tiktoken.encoding_for_model("gpt-3.5-turbo")
tokens = encoding.encode(words)
Here is a random tutorial demonstrating how to call a Python script from C#. There are many others tutorials on the net on the topic:
Note, you can also call Python directly in a C# program using
IronPython, FYI (but I have not tested it as it’s been a few years since I wrote C# code):
Thanks for the info. I was thinking of doing that but couldn’t find any useful info to help.
Thanks for the info.
See my “late to the party” post edit on
IronPython above. This is an approach you might like as well.
I figured it out in the end. I now have a tokenizer in native c# for 100k and 50k tiktoken files
The following page (and video) helped me understand what was needed, and then I wrote my own implementation.
The Rust and Python code was quite hard to follow and C# has Unicode UTF7 and UTF8 built-in. That made things a bit easier once I understood what was going on behind the scenes.
I don’t have an active github account at the moment. But if anyone wants the C# code, I’m happy to make it available. OpenAI are also, welcome to include it on the “How to count tokens” page if they want to.
I would very much like to see your implementation in C#. Personally I’m currently hacking away on a C# implementation for the Cl100k_base file. I’m close, and get mostly identical results to what tiktoken gets, but not entirely.
Sorry I’ve been off the radar for a few days. I’ll get you a link later today
No stress! Not many hours ago I completed converting the encoding algorithm used in the tiktoken library to C#. So I do currently have something functional (yet not pretty, but who cares
I’d still love to see your approach!
I sent you a private message with my email address. Did you manage to get the Chinese examples to work?
I have not seen any Chinese samples but will happily add unit tests to my code with Chinese! So far I’ve only tried a simple test string with some casing and punctuation, but nothing fancy.
There are 50k and 100k test examples at the bottom of this link
This file has been truncated.
"# How to count tokens with tiktoken\n",
"[`tiktoken`](https://github.com/openai/tiktoken/blob/main/README.md) is a fast open-source tokenizer by OpenAI.\n",
"Given a text string (e.g., `\"tiktoken is great!\"`) and an encoding (e.g., `\"cl100k_base\"`), a tokenizer can split the text string into a list of tokens (e.g., `[\"t\", \"ik\", \"token\", \" is\", \" great\", \"!\"]`).\n",
"Splitting text strings into tokens is useful because GPT models see text in the form of tokens. Knowing how many tokens are in a text string can tell you (a) whether the string is too long for a text model to process and (b) how much an OpenAI API call costs (as usage is priced by token). Different models use different encodings.\n",
"Encodings specify how text is converted into tokens. Different models use different encodings.\n",