Developing Spark Applications Using Scala & Cloudera
Apache Spark is one of the fastest and most efficient general engines for large-scale data processing. In this course, you'll learn how to develop Spark applications for your Big Data using Scala and a stable Hadoop distribution, Cloudera CDH.
What you'll learn
At the core of working with large-scale datasets is a thorough knowledge of Big Data platforms like Apache Spark and Hadoop. In this course, Developing Spark Applications Using Scala & Cloudera, you’ll learn how to process data at scales you previously thought were out of your reach. First, you’ll learn all the technical details of how Spark works. Next, you’ll explore the RDD API, the original core abstraction of Spark. Then, you’ll discover how to become more proficient using Spark SQL and DataFrames. Finally, you'll learn to work with Spark's typed API: Datasets. When you’re finished with this course, you’ll have a foundational knowledge of Apache Spark with Scala and Cloudera that will help you as you move forward to develop large-scale data applications that enable you to work with Big Data in an efficient and performant way.
Table of contents
- Getting an Environment & Data: CDH + StackOverflow 2m
- Prerequisites & Known Issues 2m
- Upgrading Cloudera Manager and CDH 6m
- Installing or Upgrading to Java 8 (JDK 1.8) 4m
- Getting Spark - There Are Several Options: 1.6 3m
- Getting Spark 2 Standalone 3m
- Installing Spark 2 on Cloudera 6m
- Data: StackOverflow & StackExchange Dumps + Demo Files 3m
- Preparing Your Big Data 4m
- Takeaway 2m
- Refreshing Your Knowledge: Scala Fundamentals for This Course 1m
- Scala's History and Overview 2m
- Building and Running Scala Applications 1m
- Creating Self-contained Applications, Including scalac & sbt 5m
- The Scala Shell: REPL (Read Evaluate Print Loop) 1m
- Scala, the Language 4m
- More on Types, Functions, and Operations 2m
- Expressions, Functions, and Methods 1m
- Classes, Case Classes, and Traits 1m
- Flow Control 1m
- Functional Programming 1m
- Enter spark2-shell: Spark in the Scala Shell 1m
- Takeaway 2m
- Understanding Spark: An Overview 3m
- Spark, Word Count, Operations, and Transformations 2m
- A Few Words on Fine Grained Transformations and Scalability 2m
- Word Count in "Not Big Data" 2m
- How Word Count Works, Featuring Coarse Grained Transformations 4m
- Parallelism by Partitioning Data 3m
- Pipelining: One of the Secrets of Spark's Performance 2m
- Narrow and Wide Transformations 4m
- Lazy Execution, Lineage, Directed Acyclic Graph (DAG), and Fault Tolerance 4m
- Time for the Big Picture: Spark Libraries 2m
- Takeaway 1m
- Getting Technical: Spark Architecture 3m
- Storage in Spark and Supported Data Formats 3m
- Let's Talk APIs: Low Level and High Level Spark APIs 5m
- Performance Optimizations: Tungsten and Catalyst 3m
- SparkContext and SparkSession: Entry Points to Spark Apps 4m
- Spark Configuration + Client and Cluster Deployment Modes 6m
- Spark on Yarn: The Cluster Manager 3m
- Spark with Cloudera Manager and YARN UI 4m
- Visualizing Your Spark App: Web UI and History Server 8m
- Logging in with Spark and Cloudera 2m
- Navigating the Spark and Cloudera Documentation 4m
- Takeaway 1m
- Learning the Core of Spark: RDDs 2m
- SparkContext: The Entry Point to a Spark Application 4m
- RDD and PairRDD - Resilient Distributed Datasets 4m
- Creating RDDs with Parallelize 4m
- Returning Data to the Driver, i.e. collect(), take(), first()... 4m
- Partitions, Repartition, Coalesce, Saving as Text, and HUE 3m
- Creating RDDs from External Datasets 10m
- Saving Data as ObjectFile, NewAPIHadoopFile, SequenceFile, ... 6m
- Creating RDDs with Transformations 3m
- A Little Bit More on Lineage and Dependencies 1m
- Takeaway 2m
- Going Deeper into Spark Core 1m
- Functional Programming: Anonymous Functions (Lambda) in Spark 2m
- A Quick Look at Map, FlatMap, Filter, and Sort 5m
- How Can I Tell It Is a Transformation 1m
- Why Do We Need Actions? 1m
- Partition Operations: MapPartitions and PartitionBy 6m
- Sampling Your Data 2m
- Set Operations: Join, Union, Full Right, Left Outer, and Cartesian 5m
- Combining, Aggregating, Reducing, and Grouping on PairRDDs 9m
- ReduceByKey vs. GroupByKey: Which One Is Better? 1m
- Grouping Data into Buckets with Histogram 3m
- Caching and Data Persistence 2m
- Shared Variables: Accumulators and Broadcast 5m
- What's Needed for Developing Self-contained Spark Applications 2m
- Disadvantages of RDDs - So What's Better? 1m
- Takeaway 2m
- Increasing Proficiency with Spark: DataFrames & Spark SQL 1m
- "Everyone" Uses SQL and How It All Began 3m
- Hello DataFrames and Spark SQL 3m
- SparkSession: The Entry Point to the Spark SQL / DataFrame API 2m
- Creating DataFrames 2m
- DataFrames to RDDs and Vice Versa 3m
- Loading DataFrames: Text and CSV 2m
- Schemas: Inferred and Programatically Specified + Option 5m
- More Data Loading: Parquet and JSON 4m
- Rows, Columns, Expressions, and Operators 2m
- Working with Columns 2m
- More Columns, Expressions, Cloning, Renaming, Casting, & Dropping 4m
- User Defined Functions (UDFs) on Spark SQL 3m
- Takeaway 2m
- Querying, Sorting, and Filtering DataFrames: The DSL 5m
- What to Do with Missing or Corrupt Data 4m
- Saving DataFrames 5m
- Spark SQL: Querying Using Temporary Views 4m
- Loading Files and Views into DataFrames Using Spark SQL 2m
- Saving to Persistent Tables + Spark 2 Known Issue 2m
- Hive Support and External Databases 5m
- Aggregating, Grouping, and Joining 5m
- The Catalog API 1m
- Takeaway 2m