Fine-tune tokens lower than expected

tiberiu · December 7, 2023, 7:46pm

I ran a fine-tune on babbage-002 on 1 epoch. I calculated the dataset size to be close to (just over) 25M tokens. When the fine-tune finished the model was trained on just over 19.5M tokens. Running the fine-tune on the Azure OpenAI api yields a even lower token count (just over 15M). Why the discrepancy?

jmportilla · December 7, 2023, 11:49pm

Which token embedding model did you use with tiktoken? babbage is older, so I don’t think its on the same embedding model as gpt-3.5-turbo, its either p50k_base or r50k_base. Although one other anecdote is I remember trying to use tiktoken in the past to estimate pricing/tokens and it was also a bit off in its estimate based on how much it cost me.

_j · December 8, 2023, 1:09am

Rather, babbage-002 and davinci-002 are the current generation replacement base models suitable for fine-tune. They replace the prior four GPT-3 base models that are simply the bare name.

Both use the cl100k_base token encoder. Tokens per example can be calculated as simply the tokenization of the prompt string count + completion string count (they should not be combined to one string if fine-tuning is being done correctly by OpenAI)

A discrepancy between OpenAI and Azure cannot be explained on the same training file if you have specified or obtained the same number of fine-tune epochs being performed. Has one perhaps billed for the validation file while the other has not?

tiberiu · December 8, 2023, 12:36pm

I did not use a validation dataset on either one, and indeed I calculated the number of tokens without combining the prompt and completion.

tiberiu · December 8, 2023, 12:45pm

I found the mistake, thanks both for the help. I used
tiktoken.encoding_for_model("babbage")
instead of
tiktoken.encoding_for_model("babbage-002")
which gives the r50k_base tokenizer instead of cl100k_base.

Topic		Replies	Views
Why does a 1115 length fine-tuning model file costs 1,520 trained tokens? API	3	1088	March 29, 2023
Token Count for Fine-tuning API fine-tuning	4	2532	December 18, 2023
Discrepancy in Token Count During Fine-Tuning Job Creation API	1	60	September 23, 2024
Struggling to get correct token count Community gpt-4 , gpt-35-turbo , api	2	1949	September 4, 2023
CLI Fine-Tune Error: Hard Billing Limit Exceeded API	9	2113	May 17, 2023

Fine-tune tokens lower than expected

Related topics