How does GPT-3 cost calculation for languages other than English?

fabien.snauwaert · February 20, 2023, 2:22pm

Stumbled upon this trying to answer my own question (“what’s the average number of characters per token in languages other than English?”) I’m still looking but I do have data to show that it’s widely variable depending on the language, just compare:

Sentence 1:

The early bird catches the worm. => 7 tokens, 32 characters => 4.57142857 chars/token
Кто рано встает, тому Бог подает. => 34 tokens, 33 characters => 0.97058824 chars/token – that is 4.8x the number of tokens compared to English!
A korán kelő madár elkapja a férget. => 21 tokens, 37 characters => 1.76190476 chars/token

Sentence 2:

Hello, world! => 4 tokens, 13 characters => 3.25 chars/token
Привет, мир! => 13 tokens, 12 characters => 0.93 chars/token
Helló, világ! => 7 tokens, 13 characters => 1.85 chars/token

This means a HUGE different in costs if not careful.

Got the values from the tokenizer and running queries on the Completion endpoint confirms them.

Topic		Replies	Views
How do I calculate the pricing for generation of text? API	11	7236	March 6, 2023
Counting tokens for chat API calls (gpt-3.5-turbo) Documentation	5	27059	December 13, 2023
Explosion in the number of tokens / words generated API gpt-4 , api	13	4340	August 9, 2023
Tokens counting for Hebrew response seems much higher API	5	1262	December 20, 2023
Understanding billing of usage API gpt-4 , api	7	2005	February 16, 2024

How does GPT-3 cost calculation for languages other than English?

Related topics