Arabic Texts Tokenization

zainab.althafir · June 7, 2023, 8:08am

Hello, I’m using OpenAI models for Arabic reading comprehension tasks, but I encountered a problem when using the cl100k tokenizer; the models didn’t reply to any of my queries; meanwhile, when I use the gpt2 tokenizer, it works as expected. Does anyone know why this may happen with Arabic texts?

Topic		Replies	Views
What cause the rubbish data output from GPT4 API calls response in arabic? Community gpt-4 , api	4	823	June 2, 2024
Right to Left languages token count Prompting	1	935	March 18, 2023
Gpt-4o-mini responses are being cut off Community gpt-4 , gpt-4o-mini	1	202	January 28, 2025
Struggling to get correct token count Community gpt-4 , gpt-35-turbo , api	2	1893	September 4, 2023
GPT-4 API Outputs Gibberish When Prompted in Arabic Community api	0	446	May 7, 2024

Arabic Texts Tokenization

Related topics