Right to Left languages token count

saied · January 24, 2023, 7:20am

I was doing some experiments with davinci3 for non-English languages, especially right-to-left languages like Arabic and I realized token count is much more than in other languages like Turkish or german.
I already know this model uses BPE tokenizer.
I wanted to ask if there is any workaround for this. especially for lowering the cost of completion and fine-tuning?
Note that I think translation is out of options since we lose some part of the context!
Thanks in advance

amranwr1981 · March 18, 2023, 7:38pm

Hi Saied
did you find any solution for this issue?
I am currently facing this problem while dealing with the API and would like to set the number of tokens to be equals to English.

Thanks

Topic		Replies	Views
How does GPT-3 cost calculation for languages other than English? API	7	4614	February 20, 2023
Tokenizer is so high in Vietnamese API	2	1349	July 9, 2024
Chat Token counts inconsistency between playground platform and tiktokenizer API chatgpt , token	2	707	December 27, 2024
Struggling to get correct token count Community gpt-4 , gpt-35-turbo , api	3	2066	December 29, 2025
Counting tokens for chat API calls (gpt-3.5-turbo) Documentation	5	28330	December 13, 2023

Right to Left languages token count

Related topics