Apache Spark 3 Fundamentals
Learn the Fundamentals of Apache Spark 3: process data, set up the environment, use RDDs & DataFrames, optimize apps, build pipelines with Databricks and Azure Synapse. Familiarize yourself with Spark's ecosystem here in this course.
What you'll learn
Apache Spark is one of the most widely used analytics engines. It performs distributed data processing and can handle petabytes of data. Spark can work with a variety of data formats, process data at high speeds, and support multiple use cases. Version 3 of Spark brings a whole new set of features and optimizations. In this course, Apache Spark 3 Fundamentals, you'll learn how Apache Spark can be used to process large volumes of data, whether batch or streaming data, and about the growing ecosystem of Spark. First, you'll learn what Apache Spark is, its architecture, and its execution model. You'll then see how to set up the Spark environment. Next, you'll learn about two Spark APIs – RDDs and DataFrames – and see how to use them to extract, analyze, clean, and transform batch data. Then, you'll learn various techniques to optimize your Spark applications, as well as the new optimization features of Apache Spark 3. After that, you'll see how to reliably store data in a Data Lake using the Delta Lake format and build streaming pipelines with Spark. Finally, you'll see how to use Spark in cloud services like Databricks and Azure Synapse Analytics. By the end of this course, you'll have the knowledge and skills to work with Apache Spark and use its capabilities and ecosystem to build large-scale data processing pipelines. So, let's get started!
Table of contents
- Module Overview 1m
- Understanding Spark Environments 6m
- Installing Spark 8m
- Monitoring Spark with Web UI 2m
- Option 1: Running Spark in Command Line 4m
- Option 2: Running Spark with Jupyter Notebooks 5m
- Option 3: Creating Project with PyCharm IDE 3m
- Option 4: Running Jobs with Spark Submit 3m
- Setting Up Multi-Node Cluster 5m
- Summary 2m
- Module Overview 1m
- Working with Spark Partitions 8m
- Changing DataFrame Partitions 6m
- Memory Management 6m
- Persisting Data 6m
- Spark Join Strategies and Broadcast Joins 6m
- Optimizing Shuffle Sort Join with Bucketing 5m
- Dynamic Resource Allocation 7m
- Resource Allocation Using Fair Scheduling 3m
- Summary 2m