Architecting Big Data Solutions Using Google Dataproc
Dataproc is Google’s managed Hadoop offering on the cloud. This course teaches you how the separation of storage and compute allows you to utilize clusters more efficiently purely for processing data and not for storage.
What you'll learn
When organizations plan their move to the Google Cloud Platform, Dataproc offers the same features but with additional powerful paradigms such as separation of compute and storage. Dataproc allows you to lift-and-shift your Hadoop processing jobs to the cloud and store your data separately on Cloud Storage buckets, thus effectively eliminating the requirement to keep your clusters always running. In this course, Architecting Big Data Solutions Using Google Dataproc, you’ll learn to work with managed Hadoop on the Google Cloud and the best practices to follow for migrating your on-premise jobs to Dataproc clusters. First, you'll delve into creating a Dataproc cluster and configuring firewall rules to enable you to access the cluster manager UI from your local machine. Next, you'll discover how to use the Spark distributed analytics engine on your Dataproc cluster. Then, you'll explore how to write code in order to integrate your Spark jobs with BigQuery and Cloud Storage buckets using connectors. Finally, you'll learn how to use your Dataproc cluster to perform extract, transform, and load operations using Pig as a scripting language and work with Hive tables. By the end of this course, you'll have the necessary knowledge to work with Google’s managed Hadoop offering and have a sound idea of how to migrate jobs and data on your on-premise Hadoop cluster to the Google Cloud.
Table of contents
- Module Overview 2m
- Prerequisites, Course Outline, and Spikey Sales Scenarios 4m
- Distributed Processing 3m
- Storage in Traditional Hadoop 3m
- Compute in Traditional Hadoop 4m
- Separating Storage and Compute with Dataproc 6m
- Hadoop vs. Dataproc 4m
- Using the Cloud Shell, Enabling the Dataproc API 4m
- Dataproc Features 4m
- Migrating to Dataproc 6m
- Dataproc Pricing 3m
- Module Overview 1m
- Creating a Dataproc Cluster Using the Web Console 7m
- Using SSH to Connect to the Master Node 4m
- Creating a Firewall Rule to Enable Access to Dataproc 5m
- Accessing the Resource Manager and Name Node UI 2m
- Upload Data and MapReduce Code to Cloud Storage 4m
- Running MapReduce on Dataproc 4m
- Running MapReduce Using the gcloud Command Line Utility 4m
- Creating a Cluster with Preemptible Instances Using gcloud 3m
- Monitoring Clusters Using Stackdriver 5m
- Stackdriver Monitoring Groups and Alerting Policies 5m
- Configuring Initialization Actions for Dataproc 4m
- Module Overview 1m
- Spark for Distributed Processing 4m
- Running a Spark Scala Job Using the Web Console 3m
- Executing a Spark Application Using gcloud 3m
- Creating a BigQuery Table 4m
- Pyspark Application Using BiqQuery and Cloud Storage Connectors 4m
- Executing a Spark Application to Get Results in BigQuery 3m
- Monitoring Spark Jobs on Dataproc 3m