Developing Spark Applications with Python & Cloudera
Apache Spark is one of the fastest and most efficient general engines for large-scale data processing. In this course, you will learn how to develop Spark applications for your Big Data using Python and a stable Hadoop distribution, Cloudera CDH.
What you'll learn
At the core of working with large-scale datasets is a thorough knowledge of Big Data platforms like Apache Spark and Hadoop. In this course, Developing Spark Applications with Python & Cloudera, you’ll learn how to process data at scales you previously thought were out of your reach. First, you’ll learn all the technical details of how Spark works. Next, you’ll explore the RDD API, the original core abstraction of Spark. Finally, you’ll discover how to become more proficient using Spark SQL and DataFrames. When you’re finished with this course, you’ll have a foundational knowledge of Apache Spark with Python and Cloudera that will help you as you move forward to develop large-scale data applications that enable you to work with Big Data in an efficient and performant way.
Table of contents
- Getting an Environment and Data: CDH + StackOverflow 2m
- Prerequisites and Known Issues 2m
- Upgrading Cloudera Manager and CDH 6m
- Installing or Upgrading to Java 8 (JDK 1.8) 4m
- Getting Spark - There Are Several Options: 1.6 3m
- Getting Spark 2 Standalone 3m
- Installing Spark 2 on Cloudera 6m
- Bonus -> IPython with Anaconda: Supercharge Your PySpark Shell 7m
- Data: StackOverflow and StackExchange Dumps + Demo Files 3m
- Preparing Your Big Data 4m
- Takeaway 2m
- Refreshing Your Knowledge: Python Fundamentals for This Course 1m
- Python's History, Philosophy, and Paradigm 3m
- The Python Shell: REPL 3m
- Syntax, Variables, (Dynamic) Types, and Operators 7m
- Compound Variables: Lists, Tuples, and Dictionaries 5m
- Code Blocks, Functions, Loops, Generators, and Flow Control 5m
- Map, Filter, Group, and Reduce 2m
- Enter PySpark: Spark in the Shell 2m
- Takeaway 2m
- Understanding Spark: An Overview 3m
- Spark, Word Count, Operations, and Transformations 2m
- A Few Words on Fine Grained Transformations and Scalability 2m
- Word Count in "Not Big Data" 2m
- How Word Count Works, Featuring Coarse Grained Transformations 4m
- Parallelism by Partitioning Data 3m
- Pipelining: One of the Secrets of Spark's Performance 2m
- Narrow and Wide Transformations 4m
- Lazy Execution, Lineage, Directed Acyclic Graph (DAG), and Fault Tolerance 4m
- The Spark Libraries and Spark Packages 2m
- Takeaway 1m
- Getting Technical: Spark Architecture 3m
- Storage in Spark and Supported Data Formats 3m
- Let's Talk APIs: Low-level and High-level Spark APIs 4m
- Performance Optimizations: Tungsten and Catalyst 3m
- SparkContext and SparkSession: Entry Points to Spark Apps 4m
- Spark Configuration + Client and Cluster Deployment Modes 6m
- Spark on Yarn: The Cluster Manager 3m
- Spark with Cloudera Manager and YARN UI 4m
- Visualizing Your Spark App: Web UI and History Server 8m
- Logging in Spark and with Cloudera 2m
- Navigating the Spark and Cloudera Documentation 4m
- Takeaway 1m
- Learning the Core of Spark: RDDs 2m
- SparkContext: The Entry Point to a Spark Application 3m
- RDD and PairRDD - Resilient Distributed Datasets 4m
- Creating RDDs with Parallelize 4m
- Returning Data to the Driver, i.e. collect(), take(), first()... 4m
- Partitions, Repartition, Coalesce, Saving as Text, and HUE 3m
- Creating RDDs from External Datasets 10m
- Saving Data as PickleFile, NewAPIHadoopFile, SequenceFile, ... 5m
- Creating RDDs with Transformations 3m
- A Little Bit More on Lineage and Dependencies 1m
- Takeaway 2m
- Going Deeper into Spark Core 1m
- Functional Programming: Anonymous Functions (Lambda) in Spark 1m
- A Quick Look at Map, FlatMap, Filter, and Sort 5m
- How I Can Tell It Is a Transformation 1m
- Why Do We Need Actions? 1m
- Partition Operations: MapPartitions and PartitionBy 7m
- Sampling Your Data 2m
- Set Operations: Join, Union, Full Right, Left Outer, and Cartesian 5m
- Combining, Aggregating, Reducing, and Grouping on PairRDDs 8m
- ReduceByKey vs. GroupByKey: Which One Is Better? 1m
- Grouping Data into Buckets with Histogram 3m
- Caching and Data Persistence 2m
- Shared Variables: Accumulators and Broadcast Variables 5m
- Developing Self-contained PySpark Application, Packages, and Files 1m
- Disadvantages of RDDs - So What's Better? 1m
- Takeaway 2m
- Increasing Proficiency with Spark: DataFrames & Spark SQL 1m
- "Everyone" Uses SQL and How It All Began 3m
- Hello DataFrames and Spark SQL 3m
- SparkSession: The Entry Point to the Spark SQL and DataFrame API 2m
- Creating DataFrames 3m
- DataFrames to RDDs and Viceversa 3m
- Loading DataFrames: Text and CSV 2m
- Schemas: Inferred and Programatically Specified + Option 5m
- More Data Loading: Parquet and JSON 4m
- Rows, Columns, Expressions, and Operators 2m
- Working with Columns 2m
- More Columns, Expressions, Cloning, Renaming, Casting, & Dropping 4m
- User Defined Functions (UDFs) on Spark SQL 3m
- Takeaway 2m
- Querying, Sorting, and Filtering DataFrames: The DSL 5m
- What to Do with Missing or Corrupt Data 4m
- Saving DataFrames 6m
- Spark SQL: Querying Using Temporary Views 4m
- Loading Files and Views into DataFrames Using Spark SQL 2m
- Saving to Persistent Tables + Spark 2 Known Issue 2m
- Hive Support and External Databases 5m
- Aggregating, Grouping, and Joining 5m
- The Catalog API 1m
- Takeaway 2m