How does GPT-3 cost calculation for languages other than English?

rluvu2 · February 15, 2023, 4:47am

according to OpenAI API, For languages other than English, the number of tokens tends to be overestimated. In fact, I don’t think GPT-3 will tokenize in this way, can I just consider this pricing?

ruby_coder · February 15, 2023, 6:01am

If I were you, I would experiment with different completions and then review your account to see how many tokens were used according to your account profile stats. Then, you can take those numbers and compare them to your original prompt-completions and create your own estimate for the token count and the cost.

Are you taking about completions, fine-tunings or embeddings?

HTH

rluvu2 · February 15, 2023, 6:07am

Thank you.

I am still in the process of comparing with models elsewhere. The billing format is important, but the output result is also important, so I checked it first before training quite a lot of data. Thank you for the good reply.

rluvu2 · February 15, 2023, 6:37am

in our tests, Although it charges a bit more, it produced similar results to the token calculator.
Doesn’t the actual GPT-3 have a separate tokenizer in languages other than English?
1200 characters and 2350 tokens is too much of a burden.

PaulBellow · February 15, 2023, 6:52am

No, I believe the current models are tokenized in English only…(At this time?) I don’t think they recommend it for non-english queries. At least in the initial beta rollout…

I believe it’s in the docs somewhere, but i don’t have the link handy at the moment. Hope this helps.

rluvu2 · February 15, 2023, 6:57am

Thank you for your advice.

When using the prompt in the playground, I think it has performance worth considering, but I am burdened with the cost of tokenizing, so I plan to use it for free credits. Thanks again for the advice.

fabien.snauwaert · February 20, 2023, 2:22pm

Stumbled upon this trying to answer my own question (“what’s the average number of characters per token in languages other than English?”) I’m still looking but I do have data to show that it’s widely variable depending on the language, just compare:

Sentence 1:

The early bird catches the worm. => 7 tokens, 32 characters => 4.57142857 chars/token
Кто рано встает, тому Бог подает. => 34 tokens, 33 characters => 0.97058824 chars/token – that is 4.8x the number of tokens compared to English!
A korán kelő madár elkapja a férget. => 21 tokens, 37 characters => 1.76190476 chars/token

Sentence 2:

Hello, world! => 4 tokens, 13 characters => 3.25 chars/token
Привет, мир! => 13 tokens, 12 characters => 0.93 chars/token
Helló, világ! => 7 tokens, 13 characters => 1.85 chars/token

This means a HUGE different in costs if not careful.

Got the values from the tokenizer and running queries on the Completion endpoint confirms them.

ruby_coder · February 20, 2023, 2:41pm

You can also get the token count by calling the completion API method.

See:

Topic		Replies	Views
How do I calculate the pricing for generation of text? API	11	7231	March 6, 2023
Counting tokens for chat API calls (gpt-3.5-turbo) Documentation	5	27021	December 13, 2023
Explosion in the number of tokens / words generated API gpt-4 , api	13	4321	August 9, 2023
Tokens counting for Hebrew response seems much higher API	5	1260	December 20, 2023
Understanding billing of usage API gpt-4 , api	7	2001	February 16, 2024

How does GPT-3 cost calculation for languages other than English?

Related topics