I’ve been creating 51,233 example files based upon usage of my own DSL, which is intended to be a secure AI language, allowing administrators to decide on a “per function” level if something can be legally done, and also runs in a virtual file system without access to anything from outside its sandbox environment. Think “safe OpenClaw” if it helps …
Anyways, these are semantics, and largely irrelevant. What I need help for is in “advanced fine tuning”.
I’m fine tuning GPT-4.1-mini with the following hyper params;
- 3 epochs
- 32 batch size
- 0.9 LR multiplier
I’m doing a “bucketed” batch building, where divide examples into 5 different buckets;
- Less than 60 tokens
- From 60 to 120 tokens
- From 120 to 200 tokens
- From 200 to 600 tokens
- Anything above 600
Then I pick from the above buckets in order; 12, 7, 4, 7, 2. This ensures that each batch starts out with 12 small examples, 7 from bucket 2, etc, and rolls around and starts again. Resulting in that smaller examples are provided first, then incrementally larger examples, until it rolls back on itself.
I’m using roughly 1,000 of the above examples by random as validation data, and during my epochs I’ve got the following full validation loss;
- ~0.045
- ~0.030
- ~0.025
The result is truly magnificent, and can solve all sorts of problems related to its domain, and it is generalising fairly well. My training examples have been prompt augmented, but I don’t have more than a maximum of 5 identical code examples tops, so the training material is really diverse and facilitates for high amount of generalisation.
However, how do I bring it to “the next level” …?
And, am I doing it right?
Ohh yeah, my bucket distribution is as follows;
- 19.103
- 9,938
- 6,717
- 11,312
- 3,232