I ran a fine-tune on babbage-002 on 1 epoch. I calculated the dataset size to be close to (just over) 25M tokens. When the fine-tune finished the model was trained on just over 19.5M tokens. Running the fine-tune on the Azure OpenAI api yields a even lower token count (just over 15M). Why the discrepancy?
Which token embedding model did you use with tiktoken? babbage is older, so I don’t think its on the same embedding model as gpt-3.5-turbo, its either p50k_base or r50k_base. Although one other anecdote is I remember trying to use tiktoken in the past to estimate pricing/tokens and it was also a bit off in its estimate based on how much it cost me.
Rather, babbage-002
and davinci-002
are the current generation replacement base models suitable for fine-tune. They replace the prior four GPT-3 base models that are simply the bare name.
Both use the cl100k_base
token encoder. Tokens per example can be calculated as simply the tokenization of the prompt string count + completion string count (they should not be combined to one string if fine-tuning is being done correctly by OpenAI)
A discrepancy between OpenAI and Azure cannot be explained on the same training file if you have specified or obtained the same number of fine-tune epochs being performed. Has one perhaps billed for the validation file while the other has not?
I did not use a validation dataset on either one, and indeed I calculated the number of tokens without combining the prompt and completion.
I found the mistake, thanks both for the help. I used
tiktoken.encoding_for_model("babbage")
instead of
tiktoken.encoding_for_model("babbage-002")
which gives the r50k_base tokenizer instead of cl100k_base.