Azure ML Thursday 5: trained Python models

Last week, we stepped out of Azure ML to look at building ML models in Python using scikit-learn. Today, we focus on getting the trained model back into Azure ML - the place where my ML solutions live in a managed, enterprise environment.

The path of bringing a trained model from the local Python/Anaconda environment towards cloud Azure ML is globally as follows:

  1. Export the trained model
  2. Zip the exported files
  3. Upload to the Azure ML environment
  4. Embed in your Azure ML solution

Sounds simple, and it isn't too hard indeed. The things getting in the way of "just" doing it are primarily a lack of Python / scikit-learn knowledge ("how do you export a trained model in the first place?") and general lack of ML experience (remember that you need to perform all translations you did on the training data exactly the same way in production!). As soon as you've learned how to tackle the first hurdle and seen the trick of importing models inside Azure ML Studio, hardly anything is holding you back to deploy your locally-developed masterpieces to production.

Step 1: Export the trained model

Remember that your trained model in Python is stored in "just" another variable - just as you're used to in any (almost) object oriented language. Python can export the content of any variable using a process called pickling1. When you pickle an object, the bytes currently in memory representing the object are dumped to (and can be loaded again from) a file.

It's actually quite easy:

For scikit-learn, it's recommended you use the joblib replacement of pickle2. It's not necessary to use joblib (you can also use pickle), but it's more efficient. Plus, it's even easier to write: you don't have to worry about file-opening modes like the "wb" and "rb" above.

joblib dump essential

For large objects, joblib often saves the contents in multiple files, whose filenames will be appended with _(counter).npy.

joblib results in multiple files

You must keep all files representing a single object together in one folder when loading it, but you don't have to interact with any of the '.npy' files: you only interact with the file you saved explicitly.

 

Step 2: Zip the exported files

In order to use pickled objects inside Azure ML's  Execute Python Script module, we need to zip everything and upload it as a dataset. Inside the zip file, all pickled objects should be in the root.

zipfile

Besides pickled object, you can include Python scripts in the zip file too. For example, you could add a Python script that unpickles the objects you need for a particular ML model so you don't have to remember the syntax and exact paths where Azure ML stores the contents. These other scripts can be consumed easily by the Execute Python Script, as I'll show in step 4.

 

Step 3: Upload to the Azure ML environment

Azure ML has no way to upload "just" libraries - all files are treated equally. The zip file should uploaded as a dataset:

UploadZipWorkflow

Step 4: Embed in Azure ML experiment

With the zipped file available as a "dataset", we can embed it inside an experiment. Inside your experiment, the zip we just uploaded is available under My Datasets. In order to use it, throw it onto the canvas and connect it to the right (as opposed to center or left) input port of an Execute Python Script module:

useuploadeddataset

When running the experiment, Azure ML Studio extracts the files inside the zip dataset to the folder "Script Bundle". From within Python you can access the files via that relative path:

As described under step 2, you could also include helper scripts. To use helper scripts, you don't have to memorize the path: you can just import the script using Python's import function:

Through the use of a helper script, the amount of code inside the Execute Python Module is kept to a minimum - which makes your datasets more portable and easier to maintain.

One More Thing: Including all transformations

In order to repeat all transformations you did on the training set in production, it's important to export not only the trained ML model: all transformations need to be exported too. When using last week's sample code, there are four objects to be exported (Imputer, Religion-mapper, One-hot encoder and the actual trained model).

If you use just the transformers from within scikit-learn, you could make your life a lot easier by using pipelines. Check out the pipeline documentation in the scikit-learn docs as well as an example of a pipeline constructed from one Imputer and one RandomForestRegressor if you want to know how. Don't worry - you'll find it's pretty easy :).

Conclusion

Last week, I showed you a brief summary of using Python with scikit-learn to train your ML models. This enhances your possibilities of applying ML techniques vastly. Today was the follow-up: how to use your trained Python ML models again within Azure ML.

With today's knowledge, it's perfectly doable to participate in an Azure ML competition using your enhanced Python ML models, or with the help of your personal data scientist port his ingenious models towards the managed Azure ML environment.

Founder of this blog. Business Intelligence consultant, developer, coach, trainer and speaker at events. Currently working at Dura Vermeer. Loves to explain things, providing insight in complex issues. Watches the ongoing development of the Microsoft Business Intelligence stack closely. Keeping an eye on Big Data, Data Science and IoT.

Leave a Reply

3 Comments

  1. Can you tell me what version of Scikit-Learn you used? AzureML currently runs version 15.1 which makes it difficult to use serialised models from later versions of Scikit-Learn on AzureML.

    • What problems do you run into?

      If I remember it correctly I used the version that came with Anaconda 4.1.1 (64-bit) - which is version 0.17.1. However, I might have used Anaconda 4.0 in order to remain compatible (Anaconda 4.0 is the one used on Azure)..

  2. ameen

    how solve this problem
    Error 0085: The following error occurred during script evaluation, please view the output log for more information:
    ---------- Start of error message from Python interpreter ----------
    Caught exception while executing function: Traceback (most recent call last):
    File "C:\pyhome\lib\pickle.py", line 268, in _getattribute
    obj = getattr(obj, subpath)
    AttributeError: module 'sklearn.externals.joblib.numpy_pickle' has no attribute 'NumpyArrayWrapper'
    During handling of the above exception, another exception occurred:
    Traceback (most recent call last):
    File "C:\server\invokepy.py", line 199, in batch
    odfs = mod.azureml_main(*idfs)
    File "C:\temp\23b0a8ea22e745bdbc79961d5cf1d10a.py", line 22, in azureml_main
    import helper #loads my_helper_script.py
    File ".\Script Bundle\helper.py", line 10, in
    trained_model = joblib.load('.\\Script Bundle\\clfGBR.p')
    File "C:\pyhome\lib\site-packages\sklearn\externals\joblib\numpy_pickle.py", line 459, in load
    obj = unpickler.load()
    File "C:\pyhome\lib\pickle.py", line 1039, in load
    dispatch[key[0]](self)
    File "C:\pyhome\lib\pickle.py", line 1343, in load_stack_global
    self.append(self.find_class(module, name))
    File "C:\pyhome\lib\pickle.py", line 1386, in find_class
    return _getattribute(sys.modules[module], name)[0]
    File "C:\pyhome\lib\pickle.py", line 271, in _getattribute
    .format(name, obj))
    AttributeError: Can't get attribute 'NumpyArrayWrapper' on
    Process returned with non-zero exit code 1

    ---------- End of error message from Python interpreter ----------
    Start time: UTC 10/31/2017 12:54:11
    End time: UTC 10/31/2017 12:54:28

Next ArticleAzure ML Thursday 6: xgboost in R