How many words in Cyrillic can you get from a million tokens?

Hello! Please advise where or how I can better understand the approximate cost of “pure words” in Bulgarian? I know that 1,000,000 tokens are worth $1.5 . Let’s say this is about 750,000 English words. But some words are still wasted when translating from English into Bulgarian. But how much of it is spent? Approximately how many Bulgarian words will there be in the end? I’m at a loss - half, a quarter or how much?

Perhaps not exactly Bulgarian, but any other Cyrillic language, it is important for me to understand at least roughly. Thank you!

Not sure I 100% understand the question but if you are looking to get an estimate of the number of tokens for a given amount of words in Bulgarian, you can get at estimate via OpenAI’s tokenizer UI here or programmatically via the OpenAI tiktoken tokenizer (details available here).

Using these tools should give you an idea about the word to token ratio.

1 Like

Thank you! Yes, I meant how many Bulgarian words can I “buy” with a million tokens. I have seen this service, but I am not sure that it works correctly. For the test, I pasted text from the pricing page Pricing

"Multiple models, each with different capabilities and price points. Prices can be viewed in units of either per 1M or 1K tokens. You can think of tokens as pieces of words, where 1,000 tokens is about 750 words. This paragraph is 35 tokens. "

As you can see from the screenshot, it shows not 35 tokens at all, but 52 (and I haven’t inserted the last phrase about tokens yet)…

1 Like

Have a look at this post that discusses some of the reasons for inconsistencies: