Azure ML Thursday 2: Train, Test, Submit!
On this second Azure ML Thursday, I'll discuss a first entry on a competition. Also, some background about splits and cross-validation. Microsoft has provided a walkthrough for your first entry, so I won't describe all the steps you'll need to take. Rather, I'll provide some first, easy tweaks to the first submission.
Choosing train / test sets
One thing valid for all ML projects, regardless which algorithm is used or the kind of data, is the need for training and test data. Machine Learning algorithms have to be precise as well as reliable. Now what do I mean by that?
Suppose I create a Machine Learning model that predicts weight of newborn babies. A model that "predicts" the weight of a newborn using the social security number, can be very precise - at least on the data it's been trained with. For every single entry in my training data, it will return exactly the right weight. 100% match - now that's precise!
However, as soon as I try to apply this model on new data, it turns out this model is not reliable: because the new entries are never heard of, the model has no way to predict the weight of a baby - not unless there's a strong relationship between social security number and weight.
The phenomenon of a model being too much fit for the it's been trained with is called overfitting. Contrary to the example given above, overfitting isn't always obvious. Therefor, when training a model we split up our available data in two sets: train data and test data (or verification data). Because the test data is never used in training the model, it provides insight in the reliability of the model.
Usually, the split ratio is between 60% and 85%, depending on data set size & the likeliness of overfitting for a particular algorithm.
Now let's say we're studying pizza sales. We're looking at a small pizza company having 5 types of pizza each being sold exactly 20 times (totaling 100 pizzas)
- Due Olive
- Tre Pezzi
- Quattro Stagioni
- Cinque Formaggi
For a testing set, we pick 20% of the rows (20 rows) randomly from the total dataset. It is theoretically possible that we have four types of pizza in the training set, and the fifth in the testset! That's definitely not what we meant when splitting the data...
Whenever you want to make sure that the proportions of data are the same in the train- and testset, you do a stratified split. So you say to your splitter: "Okay, listen up. I want to have 80% train, 20% testdata. And make sure that the relative number of pizza types is the same inside both sets!".
Given those three parameters:
- The original dataset
- The split ratio (20% for validation)
- The stratifier ("pizza types")
a stratified split is made.
Train / test sets in Azure ML Studio
Inside Azure ML Studio, the Split Data block (selected below) is used to create train / test sets. On the right side is the properties pane, on which I've highlighted the settings for stratified splitting:
Notice the trained model is scored twice. The train- and testsets (resp. numbered '1' and '2' beneath the 'Split Data' component) both are used to score the model inside the two 'Score Model' blocks. Then, the results are compared inside the 'Evaluate Model'. When visualizing the outcome of 'Evaluate Model', it's easy to see how the trained model works for the train set as well as test set:
When choosing a train and testset, you'll implicitly introduce a new bias: it could be that the model you just trained predicts well for this particular testset, when trained for this particular trainset. To reduce this bias, you could "cross-validate" your results.
Cross-validation (often abbreviated as just "cv") splits the dataset into n folds. Each fold is used once as a testset, using all other folds together as a training set. So in our pizza example with 100 records, with 5 folds we will have 5 test runs:
- rows 1-20 = testset; rows 21-100 = train set
- rows 21-40 = testset; rows 1-20 and 41-100 = train set
- rows 41-60 = testset; rows 1-40 and 61-100 = train set
- rows 61-80 = testset; rows 1-60 and 81-100 = train set
- rows 81-100 = testset; rows 1-80 = train set
This results in a much lower chance of something working for just one specific train/test distribution.
In Azure ML studio, the cross validation is available as a single element, by default using 10 folds. Remember you feed it with an untrained model and the full (non-split) dataset! (Homework question: why?)
If you want to use a different number of folds in Azure ML (recommended for smaller datasets like the Iris Flower example competition), remember to partition your dataset first using the 'partition and sample' element.
How does this influence my scores in the Azure ML Competition?
To finish, let's apply the knowledge we got here to the tutorial competition "Iris Multiclass Competition". I assume you've worked through the tutorial using the document or video, submitted your first entry, and now have the following:
- A submitted competition entry
- A place in the leaderboard (at the time of writing, default accuracy = 93.33333)
- The following experiment:
We can now adjust the 'split data' component's test set size and look at the results of the evaluation set:
|Training set size||Training set score||Test set score||Submitted score|
As you can see, the models that are published are trained as you trained them: using the split data. When we shrink down the training set too much (row 2), the accuracy of the model declines. Sometimes you might benefit from enlarging the test set size once you've found a reliable model. You could even use 100% of the data to train the model, but beware: this means you can't test the reliability any more (and technically it's a different model once trained with different data).
That's it for today! Three points to take home and apply in your first ML endeavour:
- Know the difference between precision and reliability. Using all your data to train the model seems tempting when you're just starting out on using ML, but doesn't provide you with reliable models in the first place.
- For fairly robust models like Decision Forests, the default split of .6 or .75 is pretty conservative. Research how likely the model is to overfit, and test it. On the other hand: keep an eye on the number of rows in the testset: it should remain a good representation of the entire set.
- When you've found parameters that turn out to give a robust and reliable trained model, try to see how it scales: what is the effect on reliability when you use a larger part of the data for training? At very high percentages (up to all data) this is a dangerous, but possibly profitable path: dangerous because you cannot measure how well this model performs (all data has been used for training), possibly profitable: you could get a better trained model (because all available data will be used for training).