Stumbled upon this trying to answer my own question (“what’s the average number of characters per token in languages other than English?”) I’m still looking but I do have data to show that it’s widely variable depending on the language, just compare:
Sentence 1:
-
The early bird catches the worm.
=> 7 tokens, 32 characters => 4.57142857 chars/token -
Кто рано встает, тому Бог подает.
=> 34 tokens, 33 characters => 0.97058824 chars/token – that is 4.8x the number of tokens compared to English! -
A korán kelő madár elkapja a férget.
=> 21 tokens, 37 characters => 1.76190476 chars/token
Sentence 2:
-
Hello, world!
=> 4 tokens, 13 characters => 3.25 chars/token -
Привет, мир!
=> 13 tokens, 12 characters => 0.93 chars/token -
Helló, világ!
=> 7 tokens, 13 characters => 1.85 chars/token
This means a HUGE different in costs if not careful.
Got the values from the tokenizer and running queries on the Completion endpoint confirms them.