Discrepancy in Token Count During Fine-Tuning Job Creation

When I checked my dataset’s token count using the OpenAI tokenizer, it showed 600,000 tokens. However, during the fine-tuning job creation on the OpenAI platform, it reached over 1 million tokens. My dataset remains the same, so what could cause the token count to vary during the fine-tuning job creation?

Here’s what you might be noticing:

Your training when doing fine-tune is performed in multiple passes through the same data, each time charging you for tokens.

This training hyperparameter is called “epochs”, and is auto-set based on the size of your example JSONL file if you do not specify it yourself.


Results displayed in UI, where a whole JSONL fed into a tokenizer is under 6000 tokens:

Trained tokens
43,776

Epochs
9

Batch size
1

LR multiplier
2