Getting Started with Spark 2
The 2.x releases of Spark represent significantly different and upgraded features. This course will focus on all of these changes, in both theory and practice.
What you'll learn
Spark is possibly the most popular engine for big data processing these days and the 2.x release has several new features which make Spark more powerful and easy to work with. In this course, Getting Started with Spark 2, you will get up and running with Spark 2 and understand the similarities and differences between version 2.x and older versions. First, you will get to see the basic Spark architecture and the details of Project Tungsten which brought great performance improvements to Spark 2. You will go over the new developer APIs using DataFrames and see how they inter-operate with RDDs from Spark 1.x. Next, you will move on to big data processing where you will load and clean datasets, remove invalid rows, execute transformations to extract insights and perform grouping, sorting, and aggregations using the new DataFrame APIs. You will also study how and where to use broadcast variables and accummulators. Finally, you will work with Spark SQL which allows you to use SQL commands for big data processing. The course also covers advanced SQL support in the form of windowing operations. At the end of this course, you should be very comfortable working with Spark DataFrames and Spark SQL. You will be better equipped to make technical choices based on the performance trade-offs of older versions of Spark vs. Spark 2. Software required: Apache Spark 2.2, Python 2.7.
Table of contents
- Version Check 0m
- Module Overview 2m
- Prerequisite and Course Outline 2m
- Introducing Spark 4m
- RDDs: Basic Building Blocks of Spark 8m
- RDDs, Datasets, DataFrames: What's the Difference? 6m
- Demo: Installing Spark 2 4m
- Architecture Overview: Spark 1 and 2 6m
- Demo: Working with RDDs In Spark 2 4m
- Demo: Converting RDDs to DataFrames 3m
- Demo: Working with Complex Data Types in DataFrames 1m
- Demo: Introducing the SQL Context 3m
- Demo: Accessing RDDs in DataFrames 3m
- Demo: Spark DataFrames and Pandas DataFrames 1m
- Understanding the Differences Between Spark 2 and Spark 1 4m
- Project Tungsten 6m
- Module Overview 1m
- Introducing the Spark Session 1m
- Demo: Exploring the London Crime Dataset 5m
- Demo: Grouping, Aggregating, and Ordering Data 5m
- Demo: Aggregations and Visualizations 4m
- Broadcast Variables and Accumulators 9m
- Demo: UDFs to Extract Information About Soccer Players 5m
- Demo: Working with Joins in DataFrames 5m
- Demo: Using Broadcast Variables 2m
- Demo: Working with Accumulators 5m
- Demo: Saving DataFrames as CSV and JSON Files 3m
- Demo: Using Custom Accumulators 2m
- Demo: Other Join Operations 3m