Getting Started with Apache Spark on Databricks
This course will introduce you to analytical queries and big data processing using Apache Spark on Azure Databricks. You will learn how to work with Spark transformations, actions, visualizations, and functions using the Databricks Runtime.
What you'll learn
Azure Databricks allows you to work with big data processing and queries using the Apache Spark unified analytics engine. With Azure Databricks you can set up your Apache Spark environment in minutes, autoscale your processing, and collaborate and share projects in an interactive workspace.
In this course, Getting Started with Apache Spark on Databricks, you will learn the components of the Apache Spark analytics engine which allows you to process batch as well as streaming data using a unified API. First, you will learn how the Spark architecture is configured for big data processing, you will then learn how the Databricks Runtime on Azure makes it very easy to work with Apache Spark on the Azure Cloud Platform and will explore the basic concepts and terminology for the technologies used in Azure Databricks.
Next, you will learn the workings and nuances of Resilient Distributed Datasets also known as RDDs which is the core data structure used for big data processing in Apache Spark. You will see that RDDs are the data structures on top of which Spark Data frames are built. You will study the two types of operations that can be performed on Data frames - namely transformations and actions and understand the difference between them. You’ll also learn how Databricks allows you to explore and visualize your data using the display() function that leverages native Python libraries for visualizations.
Finally, you will get hands-on experience with big data processing operations such as projection, filtering, and aggregation operations. Along the way, you will learn how you can read data from an external source such as Azure Cloud Storage and how you can use built-in functions in Apache Spark to transform your data.
When you are finished with this course you will have the skills and ability to work with basic transformations, visualizations, and aggregations using Apache Spark on Azure Databricks.
Table of contents
- Version Check 0m
- Prequisites and Course Outline 2m
- Introducing Apache Spark 5m
- Spark Architecture 5m
- Introducing Databricks 3m
- Databricks Science and Engineering Concepts 7m
- Azure Databricks Architectural Overview 4m
- Demo: Creating an Azure Databricks Workspace 3m
- Demo: Provisionsing an All Purpose Cluster 5m
- RDDs and Data Frames 7m
- Spark APIs 2m
- Demo: dbutils 3m
- Demo: Transformations and Actions on RDDs 5m
- Demo: Transformations and Actions on Data Frames 3m
- Demo: Uploading a Dataset to DBFS Using Notebooks 4m
- Demo: Basic Selection and Filtering Operations 4m
- Demo: Writing CSV Files out to DBFS 4m
- Demo: Creating a Table Using the Databricks UI 2m
- Demo: Visualizing Data Using the Display Command 3m
- Demo: Exploring Databricks Visualizations 5m
- Demo: Reading and Parsing JSON Data 6m
- Demo: Accessing Nested Fields and List Elements 5m
- Demo: Setting up an Azure Storage Account 3m
- Demo: Storing Secrets in the Azure Key Vault 2m
- Demo: Reading from Azure Data Storage 6m
- Demo: Basic SQL Transformations 5m
- Demo: Built-in Functions 6m
- Summary and Next Steps 1m