My simple implementation is 10x faster than tiktoken. Anything wrong?

youkaichao · June 3, 2023, 2:55am

I was to write a simple implementation of bpe tokenization to understand the behavior of tiktoken, only to find that my implementation turns out to be much faster than the tiktoken implementation!

The code is available at GitHub - youkaichao/fast_bpe_tokenizer: fast bpe tokenizer, simple to understand, easy to use .

I test the correctness against the whole book of “War and Peace”. The Tokenization is correct (in the sense that tokens can be decoded back to the original string).

The core code has only dozens of lines!

curt.kennedy · June 3, 2023, 3:37pm

Generally C++ is faster than Python. But in my experience, the OpenAI libraries haven’t been optimized for speed. So no big surprise you can easily beat their Python (non-optimized) implementations.

This isn’t always a bad thing since non-optimized code is often easier to read and understand.

But good job nonetheless!

bruce.dambrosio · June 3, 2023, 4:37pm

Did you match the token strings? Often there are multiple ways to encode for varying length substrings, so if the encoding you generate differs at all from the tiktoken encoding, you may run into other problems downstream. tiktoken may be doing some lookahead to determine the ‘best’ encoding.
don’t know, never looked at the code.

youkaichao · June 3, 2023, 4:50pm

Hi, fellows, thanks for the interest. I know now that I’m just doing prefix match, which is different from the encoding method from tiktoken.

I enjoy the reading from Byte Pair Encoding and Data Structures | Rust NLP tales , which clearly explains how the encoding works.

By the way, I think the reason why tiktoken is faster than huggingface is because tiktoken pre-splits the text into smaller pieces using pat_str, which accelerates the computation, but does not strictly follow bpe encoding anymore.

bruce.dambrosio · June 4, 2023, 4:37pm

Ah, thanks for the info, wasn’t aware of that.
mostly I use tiktoken to get token counts. I don’t need exact counts, so I guess I’m ok

youkaichao · June 12, 2023, 1:39pm

Closing remarks:

I created this implementation for the purpose of learning tiktoken tokenizer. Later I find that my implementation is a greedy one, like the one proposed in paper Fast WordPiece Tokenization. The actual tokenization algorithm is illustrated at here with dynamic programming. I also consulted the author of tiktoken, who is kind enough to update an educational implementation in tiktoken with visualization. Thank the community for kindly sharing these knowledge!

10basetom · October 17, 2023, 9:25am

Just curious, have you compared your performance to this project? GitHub - VKCOM/YouTokenToMe: Unsupervised text tokenizer focused on computational efficiency

Topic		Replies	Views
How do you make a bpe file for Tokenizer API	12	4491	March 23, 2023
How to do a quick estimation of token count of a text? API chatgpt , api , token	1	5408	February 19, 2024
All languages are NOT created (tokenized) equal Community token , app , comparison , statistics	9	4734	December 17, 2023
Which output is correct? playground // pycharm API	3	130	June 29, 2024
Counting tokens for chat API calls (gpt-3.5-turbo) Documentation	5	26804	December 13, 2023

My simple implementation is 10x faster than tiktoken. Anything wrong?

Related topics