How do I properly understand training data versus validation data? I have a bot im creating right now that gives back 50+ different metrics via a JSON object, what would be the best way to fine-tune this against validation data? As far as I understand, I think validation data would be what the training data is scored against, but does that mean the training data can be random? I am clearly a little confused, please help!
This is machine learning 101.
While it’s not strictly necessary for you to have a solid understanding of the fundamentals of machine learning in order to proceed, you’ll almost certainly get better results if you take the time to learn a little bit about what it is you’re trying to do.
Probably the most canonical example for this sort of thing is to build a handwritten number classifier on the MNIST dataset, so I’d suggest starting there.
I’m sorry I offended you.
I think you do give valuable responses. I was simply suggesting maybe just answer in a more basic way. There is a simple way to define the difference I requested.
I didn’t take that class. Isn’t there a more direct way to describe the difference between training and validation data?
When training a model, the primary function that is usually wanted from the resulting fine-tuning is accurate assessment of input data, one of the problems with training a neural network is “over fitting” this is where the model begins to learn the training data and the correct responses only, i.e. the model knows what the answers are only to the specific questions being asked and has a poor ability when asked to asses novel input.
To combat this, a small section of the training data is hidden away and used as validation data, so at each epoch of the models training evolution, you can test the model with the validation (unseen) data and get an idea of how well the model is able to deal with new never before seen input and still get better at training, the happy balance point is the crossover point just before the model starts to over fit, at that point you get maximum “comprehension” and minimum over fitting, which gives you the best performance for that training set.
Thank you this was an excellent response and very clear and concise
I’m not offended. The purpose of my original post was to simply illustrate that understanding the purpose of training, testing, and validation data is a critical and fundamental aspect of all statistical learning techniques.
When I wrote,
This is machine learning 101
I’m not referring to any particular course so much as indicating that concept is so absolutely foundational that I feel it would be reckless to proceed without understanding it.
For a clear and concise, broad strokes understanding we can refer to The Elements of Statistical Learning - Hastie, et al (PDF),
Concisely,
When training a model we have two goals,
- Model selection
- Model assessment
and we use data split into three sets to help us accomplish these goals,
- The training set is used to fit the models;
- The validation set is used to estimate prediction error for model selection;
- The test set is used for assessment of the generalization error of the final chosen model.
Context: (Chapter 7, Section 2, p. 222)
It is important to note that there are in fact two separate goals that we
might have in mind:
- Model selection: estimating the performance of different models in order
to choose the best one.- Model assessment: having chosen a final model, estimating its prediction error (generalization error) on new data.
If we are in a data-rich situation, the best approach for both problems is
to randomly divide the dataset into three parts: a training set, a validation
set, and a test set. The training set is used to fit the models; the validation
set is used to estimate prediction error for model selection; the test set is
used for assessment of the generalization error of the final chosen model.Ideally, the test set should be kept in a “vault,” and be brought out only
at the end of the data analysis. Suppose instead that we use the test-set
repeatedly, choosing the model with smallest test-set error. Then the test
set error of the final chosen model will underestimate the true test error,
sometimes substantially.It is difficult to give a general rule on how to choose the number of
observations in each of the three parts, as this depends on the signal-to-
noise ratio in the data and the training sample size. A typical split might
be 50% for training, and 25% each for validation and testing.The methods in this chapter are designed for situations where there is
insufficient data to split it into three parts. Again it is too difficult to give
a general rule on how much training data is enough; among other things,
this depends on the signal-to-noise ratio of the underlying function, and
the complexity of the models being fit to the data.
A 30-second Crash Course
Details
In machine learning and statistical modeling, a dataset is often partitioned into three disjoint subsets:
- Training Set: Used to train the model parameters.
- Validation Set: Used to tune model hyperparameters and select among different models.
- Test Set: Used to evaluate the final model’s generalization performance.
Simplified overview
The training set helps the model learn the underlying patterns in the data. The validation set helps to fine-tune the model and choose the best model among several candidates, without being too optimistic or pessimistic. The test set gives a final evaluation, estimating how the model will perform on unseen data.
Examples
Imagine you’re a coach training a soccer team.
- Training: You first teach your players the basic skills and plays (Training set).
- Validation: Then you have practice matches to understand how well they’re learning and adjust your coaching methods (Validation set).
- Testing: Finally, you send them to a tournament where they face unknown teams, evaluating their true skill level (Test Set).
Or, suppose you want to predict whether an email is spam or not.
- Train your model on a subset of emails to understand the features of spam emails (Training set).
- Use another subset to adjust your model’s hyperparameters like the regularization term (Validation set).
- Finally, test your model on a completely new subset of emails to see how well it generalizes (Test set).
Benefits of Including a Validation Set
Including a validation set provides an unbiased evaluation of a model fit during the training phase and allows for model selection and tuning of model hyperparameters.
By using a validation set, you can optimize your model and select the best version of it without affecting the final test set performance.
In cooking, the validation set is like tasting the dish as you cook. It guides you on when to stop cooking or if you need to add more spices, without affecting the final tasting session (test set).
You can experiment with different learning rates, and the validation set helps you pick the best one.
Selecting a Validation Set
The validation set should be a random, representative sample of the data that is mutually exclusive from the training and test sets.
Randomly pick some samples not in your training set but also make sure they are representative of your entire dataset.
Think of it as forming a quiz from the chapters of a book for a student who’s read most but not all of it. The quiz should cover the breadth of the material but not include questions from the chapters she hasn’t yet studied.
Use techniques like k-fold cross-validation to make sure that your validation set is both random and representative.
Data Requirements and Proportions
The amount of data required for training, validation, and testing is dependent on the complexity of the model, the signal-to-noise ratio, and the inherent variability in the data.
More complex models and noisy data might require more samples.
For example, if you’re assembling a complex puzzle, you’ll want to see more pieces upfront to understand how they fit together.
Use methods like learning curves to gauge how your model’s performance changes with the size of the training set.
The purpose of splitting data into training, validation, and test sets is to manage overfitting, provide an unbiased way of selecting models, and give an honest assessment of the model’s capabilities. Your goal is to make these subsets representative, mutually exclusive, and appropriately sized for your specific problem which is, unfortunately, non-trivial even in the best of situations.
Final Thoughts
It’s one thing to “know” the purpose of the training, validation, and test sets or even some general guidance about how to select them. It’s something else altogether to be able to do so effectively. This is why I recommended taking a step back and playing with just about everyone’s first neural network problem and working on the MNIST handwritten digit classification problem.
- There is plenty of data so it is possible to experiment with different train/validation/test splits.
- It has been literally done to death so there is an inexhaustible supply of fully worked examples in any language you want to use available to get you started.
- The networks built for MNIST classification are often (generally) small enough to run on a CPU so it is essentially free to play with.
- No amount of book learning or explanation can replace experience when it comes to developing real intuition for the interplay between these sets.
Look, at the end of the day you are free to do whatever you want, and there is merit in learning these concepts in a context that is exciting and relevant to you—like learning the guitar by learning to play your favorite song. But, while there are of course exceptions to the rule, most people will ultimately become better musicians if they take the time to practice some scales and learn some theory along the way.
So, while you don’t need to take an upper-division undergraduate class in machine learning in order to be able to train a fine-tuned model. If you want the best results possible and don’t want to burn a bunch of cash getting there, I really, truly believe slowing down a bit and taking some time to learn and ensure you understand some of the fundamentals is the best way forward.
Apologies if my initial post read as dismissive, I don’t always have time to write posts as comprehensively as I would like, In my mind my first post in this topic is asymptotically abridged shorthand for this post.
So first of all this is an incredible answer, and I would never expect an answer this in depth so I seriously appreciate you taking the time to write all that.
I have been a musician in all my life, I’m actually a rather advanced jazz musician so I understand this fully. Learning the “boring” stuff ultimately enabled me to have way way more fun in the future.
I agree, I do want to - Right now, I am purely a JavaScript full-stack developer as far as my technical skills go. I’m really excited to dive into this stuff and I’d like to take a course on it. Is there anything like a udemy course I could go through? I tend to really excel at those self paced courses.
It’s cool.
Ultimately my problem is now I’m running a company that’s launching two products and it is really hard to find the time to do this stuff. It sounds like to be frank I may have to hire someone for this fine-tuning portion and see what they can do. How long do you think it takes to get good at fine-tuning models? I understand what you’re saying that I can learn by doing - but i’d learn more by learning some theory first and then doing…
But assuming I don’t have much time to educate myself - are there short courses (short as in 20-30 hours) that can teach me the fundamentals?