- Lab
- A Cloud Guru
Creating Custom Modules in Azure Machine Learning Designer
While Azure Machine Learning Designer provides a wealth of transforms, not every need can be met by the predefined modules. For more specific or custom needs, you can write your own data manipulations directly in either Python or R. This gives you the power to use any standard features of those languages, such as conditionals and looping, as well as many libraries, including your own modules, to prepare, clean, and wrangle the data to your exact needs. In this lab, we will write a custom data transformation module in Python.
Path Info
Table of Contents
-
Challenge
Set Up the Workspace
-
Log in and go to the Azure Machine Learning Studio workspace provided in the lab.
-
Create a training cluster of
D2
instances. -
Create a new blank pipeline in the Azure Machine Learning Studio Designer.
-
-
Challenge
Explore the Data
-
Add an IMDB Movie Titles dataset node to the canvas. Visualize the data and try to find a pattern for the year at the end of the movie name.
-
Use the Year at end of title query in an Apply SQL Transformation node to show outliers to the basic pattern.
-
Submit the pipeline to apply the transform. Once finished, inspect the output of Apply SQL Transformation.
-
Though the data is mostly uniform, there are some values that are different. Evaluate the results to find the edge cases that will need to be captured.
-
-
Challenge
Extract the Year
-
Add an Execute Python Script transform node to the canvas.
-
Define a better regular expression pattern than what we used in the Year at end of title query above.
-
Create a function to extract the year from the title based on the regular expression.
-
Define the entry point to the code. Add "Year" as a new column in the data frame. If you have issues, refer to the Run Python code in Azure Machine Learning documentation.
-
Add an Apply SQL Transformation node to help evaluate the results. Use the Null year query.
-
Submit the pipeline, then inspect the results from the Apply SQL Transformation node. Did the regular expression capture all of the values, or will we have to do further processing later?
-
-
Challenge
Clean Up the Movie Titles
-
Add another Execute Python Script transform node to the canvas.
-
Define a regular expression pattern to strip the year and anything after it from the movie name.
-
Create a function to parse the movie name using the regular expression you just created. This function should not lose any movie names if they don't match the pattern.
-
Define the entry point to the code. Replace the current movie name with the cleaned movie name.
-
Add an Apply SQL Transformation transform node to evaluate the results. Use the Year at end of title query to show any results that were not cleaned up as expected.
-
Submit the pipeline, then inspect the results from the Apply SQL Transformation node. Did your regular expression clean up the names properly? Did you lose any data in the process? For reference, there are 16,059 rows in the original dataset.
-
-
Challenge
Combine the Custom Modules
-
With our code written and tested, combine the code into one module to make the pipeline more efficient.
-
Define regular expression patterns for extracting the year and cleaning up the movie name.
-
Create functions for applying those regular expressions to the data.
-
Define your entry point function and add calls to both of your data cleaning functions. The order in which they are called will matter.
-
Add an Apply SQL Transformation node to help evaluate the results. Use the Null year query to show any rows that did not parse the year correctly.
-
Submit the pipeline, then inspect the results from the Execute Python Script node first. Did the movie names and years come out like you expected? Did you lose any rows of data?
-
Inspect the results from the Apply SQL Transformation node. This will show you data that still needs further processing.
-
What's a lab?
Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.
Provided environment for hands-on practice
We will provide the credentials and environment necessary for you to practice right within your browser.
Guided walkthrough
Follow along with the author’s guided walkthrough and build something new in your provided environment!
Did you know?
On average, you retain 75% more of your learning if you get time for practice.