I ran a fine-tune on babbage-002 on 1 epoch. I calculated the dataset size to be close to (just over) 25M tokens. When the fine-tune finished the model was trained on just over 19.5M tokens. Running the fine-tune on the Azure OpenAI api yields a even lower token count (just over 15M). Why the discre…

Fine-tune tokens lower than expected

tiberiu December 8, 2023, 12:45pm 5

I found the mistake, thanks both for the help. I used
tiktoken.encoding_for_model("babbage")
instead of
tiktoken.encoding_for_model("babbage-002")
which gives the r50k_base tokenizer instead of cl100k_base.

Topic		Replies	Views
Why does a 1115 length fine-tuning model file costs 1,520 trained tokens? API	3	1079	March 29, 2023
Token Count for Fine-tuning API fine-tuning	4	2468	December 18, 2023
Discrepancy in Token Count During Fine-Tuning Job Creation API	1	49	September 23, 2024
Struggling to get correct token count Community gpt-4 , gpt-35-turbo , api	2	1881	September 4, 2023
CLI Fine-Tune Error: Hard Billing Limit Exceeded API	9	2016	May 17, 2023

Fine-tune tokens lower than expected

Related topics