Building Machine Learning Models in Spark 2
Training ML models is a compute-intensive operation and is best done in a distributed environment. This course will teach you how Spark can efficiently perform data explorations, cleaning, aggregations, and train ML models all on one platform.
What you'll learn
Spark is possibly the most popular engine for big data processing these days. In this course, Building Machine Learning Models in Spark 2, you will learn to build and train Machine Learning (ML) models such as regression, classification, clustering, and recommendation systems on Spark 2.x's distributed processing environment.
This course starts off with an introduction of the 2 ML libraries available in Spark 2; the older spark.mllib library built on top of RDDs and the newer spark.ml library built on top of dataframes. You will get to see the two compared to help you know when to pick one over the other.
You will get to see a classification model built using Decision Trees the old way, and see how you can implement the same model on the newer spark.ml library.
The course covers many features of Spark 2, including going over a brand new feature in Spark 2, the ML pipelines used to chain your data transformations and ML operations.
At the end of this course you will be comfortable using the advanced features that Spark 2 offers for machine learning. You'll learn to use components such as Transformers, Estimators, and Parameters within your ML pipelines to work with distributed training at scale.
Table of contents
- Version Check 0m
- Module Overview 2m
- Prerequisites and Course Overview 3m
- RDDs: The Building Blocks of Spark 4m
- DataFrames in Spark 2 2m
- Demo: Spark 2 Installation and Working with Jupyter Notebooks 4m
- spark.mllib vs. spark.ml 5m
- Introducing Decision Trees 5m
- Gini Impurity and Pros and Cons of Decision Trees 6m
- Demo: Basic Project Setup 3m
- Demo: Wine Classification Using Decision Trees in spark.mllib 8m
- Demo: Working with the LIBSVM Data Format 2m
- Demo: Decision Trees Using the LIBSVM Data Format 5m
- Module Overview 1m
- ML Pipelines, Estimators, and Transformers 7m
- Training and Prediction Pipeline Stages 3m
- Feature Engineering 2m
- Feature Extractors 4m
- Feature Transformers 4m
- Feature Selectors and Locality Sensitive Hashing 1m
- The Confusion Matrix: Accuracy, Precision, Recall, F1 Score 6m
- Demo: Wine Classification Using Decision Trees in Spark ML 3m
- Demo: Converting Categorical Data to Numeric Values 2m
- Demo: The Decision Tree Classifier 2m
- Random Forests 4m
- Demo: Income Classification Using Random Forests 4m
- Demo: Using ML Pipelines 6m
- Demo: Predictions Using the Random Forest 2m
- Introducing Regularized Regression Models to Prevent Overfitting 5m
- Lasso and Ridge Regression 3m
- Demo: Linear Regression with the Elastic Net Param 4m
- Demo: Predictions Using the Regression Model 3m
- Demo: Hyperparameter Tuning 4m
- Module Overview 1m
- Supervised and Unsupervised Learning Techniques 5m
- Clustering Objectives 3m
- Visualizing K-means Clustering 2m
- Number of Clusters as a Hyperparameter: The Elbow and Silhouette Method 8m
- Demo: K-means Clustering on the Titanic Dataset 6m
- Demo: Exploring Clusters 5m
- Principal Component Analysis: Intuition 4m
- Demo: Regression Model Without PCA 6m
- Demo: Performing Regression on Principal Components 6m
- Module Overview 1m
- Content-based and Collaborative Filtering 5m
- Estimating the Ratings Matrix 8m
- The Alternating Least Squares Method 2m
- Explicit and Implicit Ratings 6m
- Cold Start Strategies and Compute Intensity 2m
- Demo: Building a Recommendation System Using Explicit Ratings 4m
- Demo: Getting Movie Recommendations for Specific Users 4m
- Demo: Building a Recommendation System Using Implicit Ratings 4m
- Demo: Getting Artist Recommendations for Specific Users 3m
- Summary and Further Study 2m