How does GPT-3 cost calculation for languages other than English?

Stumbled upon this trying to answer my own question (“what’s the average number of characters per token in languages other than English?”) I’m still looking but I do have data to show that it’s widely variable depending on the language, just compare:

Sentence 1:

  • The early bird catches the worm. => 7 tokens, 32 characters => 4.57142857 chars/token
  • Кто рано встает, тому Бог подает. => 34 tokens, 33 characters => 0.97058824 chars/token – that is 4.8x the number of tokens compared to English!
  • A korán kelő madár elkapja a férget. => 21 tokens, 37 characters => 1.76190476 chars/token

Sentence 2:

  • Hello, world! => 4 tokens, 13 characters => 3.25 chars/token
  • Привет, мир! => 13 tokens, 12 characters => 0.93 chars/token
  • Helló, világ! => 7 tokens, 13 characters => 1.85 chars/token

This means a HUGE different in costs if not careful.

Got the values from the tokenizer and running queries on the Completion endpoint confirms them.

1 Like