Tokenizer and playground calculated a mismatch between the number of tokens and the bill for text-Davinc-003

sombriks · July 5, 2023, 2:08pm

Here goes a small update:

so far, the gap holds at 11 tokens.

i tried several different prompts and, so far, there are 11 extra tokens that i can’t figure out from where they came.

_j · July 5, 2023, 2:42pm

Nope, there is no “boundary marker” or other theories.

The playground uses a bad tokenizer for the model. Consider:

"text-davinci-003": "p50k_base",
"text-davinci-002": "p50k_base",
"text-davinci-001": "r50k_base",

Now paste a whole bunch of varied text and switch playground between davinci-001 and -003. You will see that the token count doesn’t change.

Bad count using wrong BPE:

Correct count:
tokenizer-1

Change to text-davinci-001 in tiktokenizer and get the playground’s mistaken token count.

Worry not about that which you cannot control: other’s code. Record your own token usage by what is returned and compare to billing:

--response--
{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "text": ""
    }
  ],
  "created": 1688568966,
  "id": "cmpl-xxx",
  "model": "text-babbage-001",
  "object": "text_completion",
  "usage": {
    "prompt_tokens": 1293,
    "total_tokens": 1293
  }
}
---

--response--
{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "logprobs": null,
      "text": "\n\nThis code provides a function that returns the encoding used by a given model name. It uses a dictionary to map model names to their corresponding encoding, and also checks for model names that match a known prefix. If the model name is not found, an error is raised."
    }
  ],
  "created": 1688568986,
  "id": "cmpl-xxx",
  "model": "text-davinci-003",
  "object": "text_completion",
  "usage": {
    "completion_tokens": 56,
    "prompt_tokens": 1107,
    "total_tokens": 1163
  }
}
---

sombriks · July 5, 2023, 3:05pm

Hello, i am getting the wrong count using gpt-3.5-turbo-0613 as model.

using com.theokanning.openai-gpt3-java to consume api and com.didalgo:gpt3-tokenizer to count tokens locally.

so should i use davinci-001 instead of gpt-3.5-turbo ? is there any consequences?
right now i am just addind defensive code to avoid request more tokens than the model accepts and because of that more defensive code to avoid request negative max tokens values.

thanks for the guidance.

_j · July 5, 2023, 3:23pm

GitHub - didalgolab/gpt3-tokenizer-java: Java implementation of a GPT3/4 tokenizer.? “loosely ported from tiktoken”.

It was updated five days ago, and appears to support the chat model’s 100k tokenizer:

GPT3Tokenizer tokenizer = new GPT3Tokenizer(Encoding.CL100K_BASE);
List tokens = tokenizer.encode(“example text here”);

where you must use the correct encoding method for the model selected. You can verify token counts against quality tools and actual responses. And a reminder: a client-side tokenizer needs a large dictionary file.

text-davinci-001 is a completion engine and is not strongly-trained on instruction following. It requires a different prompting style to then get a different output - and costs much more. So there are few cases to use davinci - like if you don’t want to be blasted with disclaimers and warnings.

You probably want even smarter max_token code or conversation management. Consider it a reservation for the output that is required, you don’t want just 5 tokens remaining for an answer when you’ve dumped 4000+ input tokens.

Topic		Replies	Views
Do you get billed extra when echo=true API	6	2282	January 31, 2023
The GPT-4-1106-preview model keeps generating "\\n\\n\\n\\n\\n\\n\\n\\n" for an hour when using functions API chatgpt , api	9	2471	December 31, 2023
Official tokenizer has huge count difference from OpenAI tokenizer API	12	4259	October 1, 2023
What is the OpenAI algorithm to calculate tokens? API	35	28058	December 13, 2023
Official token count differs from OpenAI tokenizer API	15	1847	January 3, 2024

Tokenizer and playground calculated a mismatch between the number of tokens and the bill for text-Davinc-003

Related topics