Predictive Analytics Using Apache Spark MLlib on Databricks
This course will teach you to understand and implement important techniques for predictive analytics such as regression and classification using Apache Spark MLlib APIs on Databricks.
What you'll learn
The Spark unified analytics engine is one of the most popular frameworks for big data analytics and processing. Spark offers extremely comprehensive and easy to use APIs for machine learning which you can use to build predictive models for regression and classification and pre-process data to feed into these models.
In this course, Predictive Analytics Using Apache Spark MLlib on Databricks, you will learn to implement machine learning models using Spark ML APIs. First, you will understand the different Spark libraries available for machine learning, the older RDD-based library, and the newer DataFrame based library. You will then explore the range of transformers available in Spark for pre-processing data for machine learning - such as scaling and standardization transformers for numeric data and label encoding and one-hot encoding transformers for categorical data.
Next, you will use linear regression and ensemble models such as random forest and gradient boosted trees to build regression models. You will use these models for prediction on batch data. In addition, you will also see how you can use Spark ML Pipelines to chain together transformers and estimators to build a complete machine learning workflow.
Finally, you will implement classification models using logistic regression as well as decision trees. You will train the ML model using batch data but perform predictions on streaming data. You will also use hyperparameter tuning and cross-validation to find the best model for your data.
When you’re finished with this course, you’ll have the skills and knowledge to create ML models with Spark MLlib needed to perform predictive analysis using machine learning.
Table of contents
- Version Check 0m
- Prerequisites and Course Outline 2m
- Machine Learning on Apache Spark 5m
- Demo: Configuring the Workspace and Setting up a Notebook 3m
- Demo: Exploring the Diabetes Dataset 4m
- Demo: Standardization and Scaling 5m
- Demo: Normalization 3m
- Demo: Converting Continuous Values to Categorical Values 2m
- Demo: Tokenizing Text Data 3m
- Demo: Label Encoding and One-hot Encoding 5m
- Demo: Feature Selection 6m
- Quick Overview of Linear Regression 5m
- Lasso Ridge and Elastic Net Regression 4m
- Demo: Exploring the Life Expectancy Dataset 4m
- Demo: Building and Evaluating a Linear Regression Model 6m
- Demo: Hyperparameter Tuning 4m
- Quick Overview of Ensemble Learning 3m
- Averaging and Boosting 2m
- Machine Learning Pipelines 3m
- Demo: Exploring the CO2 Emissions Dataset 4m
- Demo: Random Forest Regression 5m
- Demo: Gradient Boosted Tree Regression 5m
- Quick Overview of Logistic Regression 6m
- Demo: Exploring the Loan Dataset 3m
- Demo: Logistic Regression 4m
- Demo: Performing Predictions on Streaming Data 5m
- Quick Overview of Decision Trees 3m
- Demo: Exploring the Bank Marketing Campaign Dataset 3m
- Demo: Decision Tree Classifier 7m
- Demo: Hyperparameter Tuning with Cross Validation 3m
- Summary and Further Study 2m