Is Tokenizer.from_pretrained("gpt2") the same tokenizer used in your GPT3 and ChatGPT models?

nbro · March 8, 2023, 2:04pm

My question is in the title.

Tokenizer.from_pretrained("gpt2") refers to the function from the tokenizers package: GitHub - huggingface/tokenizers: 💥 Fast State-of-the-Art Tokenizers optimized for Research and Production.

So, if I use this tokenizer to count the number of tokens that we pass to GPT3 (davinci, ada, etc) and ChatGPT (turbo) models, will this be accurate?

If not, should I use another tokenizer? Which one?

Do you provide a function that already does this in some package? I don’t seem to find it in the openai package, but I think this should be part of the package because we really need to calculate the number of tokens passed to the prompt, suffix and sum them to max_tokens, and then check if that’s less than GPT3’s models’ context length. In the case of ChatGPT3 models, we also need to do a similar thing. I found this GitHub - openai/tiktoken, but this seems to be in development and it uses some pre-trained models which don’t seem to be open-source.

Please, don’t answer with guesses (like I tried this and that I think that’s that). I would prefer an official answer from an OpenAI employee.

AgusPG · March 8, 2023, 3:06pm

Short answer: no.
Long answer: link

nbro · March 8, 2023, 3:12pm

I’ve already read that notebook. That notebook uses (as far as I understand) a certain non-open-source models for ChatGPT, but it doesn’t mean that, under the hood, the models don’t use gpt2. Hence my question and my wish to get an official answer from an OpenAI employee. If I can avoid using some non-open-source model, I’d prefer that.

AgusPG · March 8, 2023, 3:21pm

I don’t understand your statement. Of course it uses “non-open-source models” for ChatGPT: ChatGPT is not an open source model. But all the tokenizers have been open-sourced. Starting with GPT2TokenizerFast (Transformers) and now tiktoken (which is supported by OpenAI themselves). You can literally read it in the link (which is an “official source”). r50k_base (or, equivalently, “gpt2”) is the tokenizer used by previous GPT-3 models, like davinci. cl100k_base is the new one, only accesible via tiktoken, that is used by ChatGPT models. They are indeed different: ChatGPT models do not tokenize using gpt2. Sorry for trying to help if you still prefer an “official source”. I will just stop doing it you prefer, np

nbro · March 8, 2023, 3:25pm

Thanks for trying to help. But like I said, from my perspective, if some package uses some model to initialise some tokenizer that associates it with ChatGPT, it doesn’t necessarily imply that, under the hood, the OpenAI API uses this tokenizer. If this is stated somewhere in the documentation, would already be an answer to me, but I haven’t found this anywhere. Moreover, even if ChatGPT uses a different model (turbo or whatever), it could still use a gpt2 tokenizer.

georgei · March 8, 2023, 3:26pm

Give a try to the tokenizer tool from OpenAI.
At the bottom of the page you can find links to the library/package.

So far I’ve used the NodeJS package with davinci and it worked as expected.

I haven’t checked with turbo yet, but I don’t expect to be different.

nbro · March 8, 2023, 3:28pm

@georgei Exactly. That page still points to the GPT2 tokenizer from Hugging Face’s package tokenizers: OpenAI GPT2

georgei · March 8, 2023, 3:30pm

Well, test for yourself then.
The GPT-2 tokenizer is valid for GPT-3 too.
Since ChatGPT is GPT-3 based, then we can only assume that the tokenizer is still valid.

nbro · March 8, 2023, 3:32pm

@georgei The purpose of this question is: I don’t want to assume, I want facts. In fact, you’re assuming this, while another user assumed something else based on other info and some inference.

AgusPG · March 8, 2023, 3:35pm

It is not an assumption. It is very clearly stated in the link:

If this is not enough evidence, I give up. Have a nice day guys!

nbro · March 8, 2023, 3:38pm

I read that. That’s not a confirmation. It’s not saying “Under the hood, OpenAI uses this model X to initialise a tokenizer that is used in ChatGPT models”. That’s just a table that suggest that, but the author could also have just decided to use those models for the ChatGPT tokenizers for some other reason. Yours is a reasonable assumption given that this repo belongs to the OpenAI organization on Github, but I’d like a confirmation from an official OpenAI employee. Thanks.

Topic		Replies	Views
What is difference between GPT2 and GPT3 tokenizers? API	1	1734	February 21, 2024
Official tokenizer has huge count difference from OpenAI tokenizer API	12	4642	October 1, 2023
Official token count differs from OpenAI tokenizer API	15	1866	January 3, 2024
Chat Token counts inconsistency between playground platform and tiktokenizer API chatgpt , token	2	647	December 27, 2024
Struggling to get correct token count Community gpt-4 , gpt-35-turbo , api	2	1839	September 4, 2023

Is Tokenizer.from_pretrained("gpt2") the same tokenizer used in your GPT3 and ChatGPT models?

Related topics