Discrepancy in Token Count During Fine-Tuning Job Creation

vkeerthichavla · September 23, 2024, 9:06am

When I checked my dataset’s token count using the OpenAI tokenizer, it showed 600,000 tokens. However, during the fine-tuning job creation on the OpenAI platform, it reached over 1 million tokens. My dataset remains the same, so what could cause the token count to vary during the fine-tuning job creation?

_j · September 23, 2024, 9:20am

Here’s what you might be noticing:

Your training when doing fine-tune is performed in multiple passes through the same data, each time charging you for tokens.

This training hyperparameter is called “epochs”, and is auto-set based on the size of your example JSONL file if you do not specify it yourself.

Results displayed in UI, where a whole JSONL fed into a tokenizer is under 6000 tokens:

Trained tokens
43,776

Epochs
9

Batch size
1

LR multiplier
2

Topic		Replies	Views
Is openai finetuning process internally use Early stopping mechanism API fine-tuning	2	158	September 24, 2024
Token Count for Fine-tuning API fine-tuning	4	2561	December 18, 2023
Why does a 1115 length fine-tuning model file costs 1,520 trained tokens? API	3	1089	March 29, 2023
Fine-tune tokens lower than expected API fine-tuning , token , fine-tuning-problems	4	1066	December 8, 2023
Finetuning costs not as expected. Difference of about factor 11 API	2	1172	January 9, 2023

Discrepancy in Token Count During Fine-Tuning Job Creation

Related topics