according to OpenAI API, For languages other than English, the number of tokens tends to be overestimated. In fact, I don’t think GPT-3 will tokenize in this way, can I just consider this pricing?
If I were you, I would experiment with different completions and then review your account to see how many tokens were used according to your account profile stats. Then, you can take those numbers and compare them to your original prompt-completions and create your own estimate for the token count and the cost.
Are you taking about completions, fine-tunings or embeddings?
HTH
Thank you.
I am still in the process of comparing with models elsewhere. The billing format is important, but the output result is also important, so I checked it first before training quite a lot of data. Thank you for the good reply.
in our tests, Although it charges a bit more, it produced similar results to the token calculator.
Doesn’t the actual GPT-3 have a separate tokenizer in languages other than English?
1200 characters and 2350 tokens is too much of a burden.
No, I believe the current models are tokenized in English only…(At this time?) I don’t think they recommend it for non-english queries. At least in the initial beta rollout…
I believe it’s in the docs somewhere, but i don’t have the link handy at the moment. Hope this helps.
Thank you for your advice.
When using the prompt in the playground, I think it has performance worth considering, but I am burdened with the cost of tokenizing, so I plan to use it for free credits. Thanks again for the advice.
Stumbled upon this trying to answer my own question (“what’s the average number of characters per token in languages other than English?”) I’m still looking but I do have data to show that it’s widely variable depending on the language, just compare:
Sentence 1:
-
The early bird catches the worm.
=> 7 tokens, 32 characters => 4.57142857 chars/token -
Кто рано встает, тому Бог подает.
=> 34 tokens, 33 characters => 0.97058824 chars/token – that is 4.8x the number of tokens compared to English! -
A korán kelő madár elkapja a férget.
=> 21 tokens, 37 characters => 1.76190476 chars/token
Sentence 2:
-
Hello, world!
=> 4 tokens, 13 characters => 3.25 chars/token -
Привет, мир!
=> 13 tokens, 12 characters => 0.93 chars/token -
Helló, világ!
=> 7 tokens, 13 characters => 1.85 chars/token
This means a HUGE different in costs if not careful.
Got the values from the tokenizer and running queries on the Completion endpoint confirms them.
You can also get the token count by calling the completion API method.
See: