I was to write a simple implementation of bpe tokenization to understand the behavior of tiktoken
, only to find that my implementation turns out to be much faster than the tiktoken
implementation!
The code is available at GitHub - youkaichao/fast_bpe_tokenizer: fast bpe tokenizer, simple to understand, easy to use .
I test the correctness against the whole book of “War and Peace”. The Tokenization is correct (in the sense that tokens can be decoded back to the original string).
The core code has only dozens of lines!
1 Like
Generally C++ is faster than Python. But in my experience, the OpenAI libraries haven’t been optimized for speed. So no big surprise you can easily beat their Python (non-optimized) implementations.
This isn’t always a bad thing since non-optimized code is often easier to read and understand.
But good job nonetheless!
1 Like
Did you match the token strings? Often there are multiple ways to encode for varying length substrings, so if the encoding you generate differs at all from the tiktoken encoding, you may run into other problems downstream. tiktoken may be doing some lookahead to determine the ‘best’ encoding.
don’t know, never looked at the code.
Hi, fellows, thanks for the interest. I know now that I’m just doing prefix match, which is different from the encoding method from tiktoken.
I enjoy the reading from Byte Pair Encoding and Data Structures | Rust NLP tales , which clearly explains how the encoding works.
By the way, I think the reason why tiktoken is faster than huggingface is because tiktoken pre-splits the text into smaller pieces using pat_str
, which accelerates the computation, but does not strictly follow bpe encoding anymore.
1 Like
Ah, thanks for the info, wasn’t aware of that.
mostly I use tiktoken to get token counts. I don’t need exact counts, so I guess I’m ok
Closing remarks:
I created this implementation for the purpose of learning tiktoken
tokenizer. Later I find that my implementation is a greedy one, like the one proposed in paper Fast WordPiece Tokenization. The actual tokenization algorithm is illustrated at here with dynamic programming. I also consulted the author of tiktoken
, who is kind enough to update an educational implementation in tiktoken
with visualization. Thank the community for kindly sharing these knowledge!