Codex Tokenizer Logic

I am curious if I can get a rundown on the tokenizer logic on the Codex Tokenizer vs GPT-3. I saw it on here but there is nothing specific on how the tokenizer works other than an obfuscated code snippet that I can find on the browser from this page and the playground page. Do we plan on open sourcing the Tokenizer we use on the playground? If we don’t, I am happy to take the underlying logic and apply into the gpt-3-encoder package.


1 Like

I decided to bite the bullet and built one for NodeJS - GitHub - xanthous-tech/gpt3-tokenizer: Isomorphic Tokenizer for GPT3 algorithm for OpenAI.

I have incorporated the best from what is out there and followed the minified code from the OpenAI tokenizer demo page. Currently it supports NodeJS but I should be able to quickly make it available on browser (the same as the token counting on playground). It should also properly count tokens for codex models since it is merging continuous spaces as single token.