Azure ML Thursday 6: xgboost in R

Last Azure ML Thursdays we explored how to do our Machine Learning in Python. Python in Azure ML doesn't include one particularly succesful algorithm though - xgboost. Python packages are available, but just not yet for Windows - which means also not inside Azure ML Studio. But they are available inside R! Today, we take the same approach as two weeks ago: first, we move out of Azure ML to do our first ML in R, then (next week) we'll upload and use our trained R model inside Azure ML studio.

Today, I'll show you how use xgboost on the still ongoing Cortana Intelligence Competition "Women's Health Risk Assessment" (WHRA). At the moment of writing, the leaderboard stayed the same for over three weeks, with only 336 participants - but ending in a week, with a grand prize of $3,000.

So rush to participate, and use the knowledge shared here to win - all code presented below can be run in order and will result in a trained model for the WHRA dataset!

ICYMI: What is R?

R is a statistical language widely used by academic and data scientists alike. R is open source, very powerful, reminds some people of Matlab1 and having originated in statistics, R has a pretty solid collection of Machine Learning libraries. One of the great advantages of R is that it's being extensively used by a lot of researchers in the field of statistics and Machine Learning, which means the newest, best-performing algorithms will pretty much always be available in R. Its open source and organic character made enterprises initially somewhat hesitant to start using it, but currently more and more large vendors are backing it with enterprise-grade support.

When you're getting started using R, I highly recommend to download and use R Studio2. You can download R Studio on https://www.rstudio.com/

Machine Learning inside R

The steps of doing Machine Learning are not very different from the steps we've taken earlier in Python: it's still about transformations to the dataset, splitting into train / testsets, training the model using parameters and scoring the model by testing it.

Prerequisites: the libraries

We'll use four libraries while working in our local R environment:

If one or more libraries are missing, add them one by one using  install.packages:

Loading the dataset

To load the dataset, we use read.csv  - here's the Women Health training set of the (still ongoing) Cortana Intelligence Competition , we use read.csv.

After that, we combine the three to-be-predicted columns inside one column (for a detailed description of the dataset see this document) and remove the columns with which we would be able to identify a patient (read my earlier post about overfitting if you wonder why):

One-hot encoding

In my earlier post "Azure ML Thursday 4: ML in Python", I've explained what One-Hot Encoding is (and why you need it). Inside R, usually you don't use the "one-hot encoder" as in sklearn. Instead, you can use the acm.disjonctif  to create dummies:

Basically, the 'dummies' method does the same as a One-Hot Encoder in sklearn, with one exception: the One-Hot Encoder inside sklearn separates "fit" from "transform", which means it can re-apply the same transformation to new datasets. The "dummies" method on the other hand uses the contents of the columns to encode column names. This can introduce new columns when new values appear in production data that weren't present in training data, so you might need to prepare for that. In the WHRA example this isn't a problem though - the only column we'll remove here is the empty religion:

Splitting the data sets

The easiest way to split dataset is the createDataPartition  function from the caret  library. The p states how much data is reserved for training here:

The caret library is not available in Azure ML Studio, but that's no problem - as the training is done on-premises, we don't need to split train- and testdata in Azure ML Studio.

Prepare labels for xgboost

xgboost expects labels and features in separate sets, so we should split X from Y, and clear the labels in X:

xgboost also expects the labels to be a zero-based numeric. It's not too hard to achieve that, but keep in mind you should be able to translate the predicted values to the corresponding classes!

 

Train the XgBoost model

After having prepared the dataset, we can now train xgboost. It's pretty easy:

Note that the parameters of xgboost used here fall in three categories:

  • General parameters
    • nthread (number of threads used, here 8 = the number of cores in my laptop)
  • Booster parameters
    • max.depth (of tree)
    • eta
  • Learning task parameters
    • objective: type of learning task (softmax for multiclass classification)
    • num_class: needed for the "softmax" algorithm: how many classes to predict?
  • Command Line Parameters
    • nround: number of rounds for boosting

For a complete overview of parameters see https://github.com/dmlc/xgboost/blob/master/doc/parameter.md.

Predictions using the trained xgboost model

To predict new cases using the just-trained xgboost model, use the function predict . In the example below, I've included three lines to test the performance too:

Conclusion

As you see, it's not too hard to use the winning xgboost algorithm inside R. All code presented above can be executed in order, and will result in a working predictive model for the ongoing Womens Health Risk Assessment (WHRA) challenge! Next week (just before the deadline) I'll show you how to import the model inside Azure ML Studio, but if you want to do it earlier, I'm pretty sure you can figure out how to do it yourself (or just ask me to share it with you).

 

Founder of this blog. Business Intelligence consultant, developer, coach, trainer and speaker at events. Currently working at Dura Vermeer. Loves to explain things, providing insight in complex issues. Watches the ongoing development of the Microsoft Business Intelligence stack closely. Keeping an eye on Big Data, Data Science and IoT.

Leave a Reply

2 Comments

  1. RedTemp

    Thanks for the post...how do you actually import the model inside Azure ML Studio?

Next ArticleAutomated Testing for the Data Warehouse