So, if I use this tokenizer to count the number of tokens that we pass to GPT3 (davinci, ada, etc) and ChatGPT (turbo) models, will this be accurate?
If not, should I use another tokenizer? Which one?
Do you provide a function that already does this in some package? I don’t seem to find it in the openai package, but I think this should be part of the package because we really need to calculate the number of tokens passed to the prompt, suffix and sum them to max_tokens, and then check if that’s less than GPT3’s models’ context length. In the case of ChatGPT3 models, we also need to do a similar thing. I found this GitHub - openai/tiktoken, but this seems to be in development and it uses some pre-trained models which don’t seem to be open-source.
Please, don’t answer with guesses (like I tried this and that I think that’s that). I would prefer an official answer from an OpenAI employee.
I’ve already read that notebook. That notebook uses (as far as I understand) a certain non-open-source models for ChatGPT, but it doesn’t mean that, under the hood, the models don’t use gpt2. Hence my question and my wish to get an official answer from an OpenAI employee. If I can avoid using some non-open-source model, I’d prefer that.
I don’t understand your statement. Of course it uses “non-open-source models” for ChatGPT: ChatGPT is not an open source model. But all the tokenizers have been open-sourced. Starting with GPT2TokenizerFast (Transformers) and now tiktoken (which is supported by OpenAI themselves). You can literally read it in the link (which is an “official source”). r50k_base (or, equivalently, “gpt2”) is the tokenizer used by previous GPT-3 models, like davinci. cl100k_base is the new one, only accesible via tiktoken, that is used by ChatGPT models. They are indeed different: ChatGPT models do not tokenize using gpt2. Sorry for trying to help if you still prefer an “official source”. I will just stop doing it you prefer, np
Thanks for trying to help. But like I said, from my perspective, if some package uses some model to initialise some tokenizer that associates it with ChatGPT, it doesn’t necessarily imply that, under the hood, the OpenAI API uses this tokenizer. If this is stated somewhere in the documentation, would already be an answer to me, but I haven’t found this anywhere. Moreover, even if ChatGPT uses a different model (turbo or whatever), it could still use a gpt2 tokenizer.
I read that. That’s not a confirmation. It’s not saying “Under the hood, OpenAI uses this model X to initialise a tokenizer that is used in ChatGPT models”. That’s just a table that suggest that, but the author could also have just decided to use those models for the ChatGPT tokenizers for some other reason. Yours is a reasonable assumption given that this repo belongs to the OpenAI organization on Github, but I’d like a confirmation from an official OpenAI employee. Thanks.