For some background, I’m working on a Slack bot that uses the chat API to send prompts to a fine-tuned model, and I want to include as much previous message history as possible each time I make a request. This also includes a system prompt I’m adding. Based on what I thought I understood about the token limit for the model (which should be 4097), since the model was fine-tuned on completions of no more than 1500 tokens, I could determine how much prior Slack message history to include if I limit that history to 4097 - 1500 = 2597 tokens.
However, no matter what I do, my code seems to be underestimating the amount of tokens present.
For example, if I do the following:
import gptTokenizer from "gpt-tokenizer/model/gpt-3.5-turbo"; gptTokenizer.encode("Foo bar baz.", "gpt-3.5-turbo"); // > [ 42023, 3703, 51347, 13 ]
Unless I’m reading this wrong, that’s 4 tokens.
But pasting in the identical text into OpenAI Platform shows 6 tokens, not 4. I’m guessing this is because that page is using p50k and not cl100k (supposedly used by gpt-3.5 and gpt-4).
And yeah, I almost always get the following from the chat API with my bot code:
This model's maximum context length is 4097 tokens. However, you requested 4189 tokens (666 in the messages, 3523 in the completion). Please reduce the length of the messages or completion.
My token counts are consistently less than that of the API, but not by much. This stinks because I know I could probably use p50k or just subtract some arbitrary amount of tokens so that my message history will almost always fit the context limit, but I would prefer greater precision than that.
Does anyone know what I’m doing wrong or have any suggestions as to a better library to use? I’m so confused because I haven’t come across any other nuance to be aware of when it comes to the number of tokens.