Azure ML Thursday 4: ML in Python

On this fourth Azure ML Thursday series we move our ML solution out of Azure ML and set our first steps in Python with scikit-learn. Today, we look at using "just" Python for doing ML, next week we bring the trained models to Azure ML. You'll notice there's a lot more to tweak and improve once you do your Machine Learning here! ML in Python is a quite large topic, so be many subjects will only be touched lightly. Nonetheless, I try to give just enough samples and basics to get your first ML models running in there!

Python, Anaconda, Jupyter and Azure ML Studio

Python is often used in conjunction with the scikit-learn collection of libraries. The most important libraries used for ML in Python are grouped inside a distribution called Anaconda. This is the distribution that's also used inside Azure ML1. Besides Python and scikit-learn, Anaconda contains all kinds of Data Science-oriented packages. It's a good idea to install Anaconda as a distribution and use Jupyter (formerly IPython) as development environment: Anaconda gives you almost the same environment on your local machine as your code will run in once in Azure ML. Jupyter gives you a nice way to keep code (in Python) and write / document (in Markdown) together.

Anaconda can be downloaded from https://www.continuum.io/downloads.

Jupyter - cloud or local?

Jupyter can be run from within Azure ML Studio too (currently in preview, it's called "Azure Notebooks"). The datasets that are available within the experiments can easily be used in a notebook, and Jupyter plays nicely together with Azure ML Experiments overall. Personally, I still favor a local Jupyter installation for two reasons:

  • On Azure Notebooks, if you're idle for more than one hour the notebook server will be reclaimed & recycled (so no use in starting a grid CV running for multiple hours)
  • I couldn't find a way to locate my pickled models at the Azure

I do still use Azure Notebooks, especially to double-test my locally developed model in the cloud (had some strange issues there). Jupyter is included with Anaconda.

The actual model development

When you've set your first steps inside Azure ML and cross over to Python/sklearn to perform Machine Learning, there are a few "new" things to learn:

  • X and Y
  • NA-values
  • Non-existence of multiclass classifiers
  • Splitting column types for preprocessing
  • Feature stacking

X and Y

Inside Azure ML studio the terms X and Y are not used - but I think they're the most common terms in supervised Machine Learning:

  • X: the features used as input for the predictive model. In the Iris Flower example: sepal_length, sepal_width, petal_width, petal_length
  • Y: the features going to be predicted. In the Iris Flower example: class

Xtrain and Ytrain are X and Y, but limited to the rows of the training set.

NA-values

Scikit-learn algorithms cannot handle blank values (here encoded as a NaN). In Azure ML experiments, you usually clean blanks using "Clean Missing Data".

cleanmissingdata

In scikit-learn, blanks are filled easily using an Imputer:

Two things to remember here:

  1. Imputer cannot handle textual columns - so in order to impute the most frequent values on textual (categorical) columns, you need to convert them to numbers first2
  2. Imputer can use the median, mean or the most frequent value to fill the blanks. The median used here is the median of the training set. When processing the test set or predicting real-world values remember to use the already trained Imputer too! Every transformation for training should be repeated for prediction

Non-existence of multiclass classifiers (and how to work around that)

Azure ML Studio provides "multiclass models", which can be trained with one label column containing multiple classes. For example: one column "religion", four possible values: "Roman Catholic", "Muslim", "Eastern Orthodox" and "Jew". However, behind the scenes classification models often work a little different: any classification column can be only one or zero, but you can predict multiple columns - all of them being binary. Notice this enables you to transform the model: column "religion" can be transformed from one four-value column to to four binary columns "isRomanCatholic", "isMuslim", "isEasternOrthodox" and "isJew". For every row, one of these columns contains a one, the rest contains zeroes.

The process of translating the single multiclass column to multiple two-class columns is called One-Hot Encoding:

Again, remember that exactly this process needs to be repeated for all predictions too!

Splitting column types for preprocessing

One-Hot encoding only needs to be done on multiclass columns that must be translated to multiple binary-class columns. In other words, we don't want to include numeric columns like age. In order to preprocess the different column types separately, we first split the columns. I use a helper function to do this3:

Using the helper function I define which column headers are used for split X into XCategorical, XNumeric and XBinary4

Feature stacking

After preprocessing, we need to paste the preprocessed columns together again. This is called feature stacking and provided for in numpy:

Training the model

The actual model training is quite easy - here is all code you need:

Of course, stratified splits, gridsearches and cross-validations are available too. I won't spell every detail, but just to show how easy it is:

Stratified split

Gridsearch

As you see, this is a quite extensive search (takes some hours) - not recommended, but it shows the workings! Notice that this grid search does a cross-validation by default (here with 5 folds)

Cross-validation

Imagine GridSearchCV, but with all parameters fixed...

Conclusion

The whole zoo of Python, Anaconda and pandas may seem daunting at first, but it's not very hard to make the move from Azure ML to Machine Learning in Python. I spent a lot of time on figuring out how to mold the dataframes for the different operations - in shape (as_matrix, unstack, etc.) as well as in conversions. Next week we'll look how to include trained scikit-learn models in AzureML. To close this week's Azure ML Thursday I'll finish with one last code listing for a basic model on the Women's Health dataset5. Disclaimer: it's deliberately kept very basic: no custom imputer per datatype, not a very good predicting model... Up to you to tweak it using other models, tweak parameters and so on. Feel free to post your improvements in comments below!

 

Founder of this blog. Business Intelligence consultant, developer, coach, trainer and speaker at events. Currently working at Dura Vermeer. Loves to explain things, providing insight in complex issues. Watches the ongoing development of the Microsoft Business Intelligence stack closely. Keeping an eye on Big Data, Data Science and IoT.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

Next ArticleAzure ML Thursday 5: trained Python models