When I checked my dataset’s token count using the OpenAI tokenizer, it showed 600,000 tokens. However, during the fine-tuning job creation on the OpenAI platform, it reached over 1 million tokens. My dataset remains the same, so what could cause the token count to vary during the fine-tuning job creation?
Here’s what you might be noticing:
Your training when doing fine-tune is performed in multiple passes through the same data, each time charging you for tokens.
This training hyperparameter is called “epochs”, and is auto-set based on the size of your example JSONL file if you do not specify it yourself.
Results displayed in UI, where a whole JSONL fed into a tokenizer is under 6000 tokens:
Trained tokens
43,776
Epochs
9
Batch size
1
LR multiplier
2