Exploring the Apache Beam SDK for Modeling Streaming Data for Processing
Apache Beam is an open-source unified model for processing batch and streaming data in a parallel manner. Built to support Google’s Cloud Dataflow backend, Beam pipelines can now be executed on any supported distributed processing backends.
What you'll learn
Apache Beam SDKs can represent and process both finite and infinite datasets using the same programming model. All data processing tasks are defined using a Beam pipeline and are represented as directed acyclic graphs. These pipelines can then be executed on multiple execution backends such as Google Cloud Dataflow, Apache Flink, and Apache Spark.
In this course, Exploring the Apache Beam SDK for Modeling Streaming Data for Processing, we will explore Beam APIs for defining pipelines, executing transforms, and performing windowing and join operations.
First, you will understand and work with the basic components of a Beam pipeline, PCollections, and PTransforms. You will work with PCollections holding different kinds of elements and see how you can specify the schema for PCollection elements. You will then configure these pipelines using custom options and execute them on backends such as Apache Flink and Apache Spark.
Next, you will explore the different kinds of core transforms that you can apply to streaming data for processing. This includes the ParDo and DoFns, GroupByKey, CoGroupByKey for join operations and the Flatten and Partition transforms.
You will then see how you can perform windowing operations on input streams and apply fixed windows, sliding windows, session windows, and global windows to your streaming data. You will use the join extension library to perform inner and outer joins on datasets.
Finally, you will configure metrics that you want tracked during pipeline execution including counter metrics, distribution metrics, and gauge metrics, and then round this course off by executing SQL queries on input data.
When you are finished with this course you will have the skills and knowledge to perform a wide range of data processing tasks using core Beam transforms and will be able to track metrics and run SQL queries on input streams.
Table of contents
- Demo: Creating and Executing a Beam Pipeline 7m
- Demo: Pipeline Specification Using Map Elements and Flat Map Elements 6m
- Demo: File Source and Files Sink 6m
- Demo: Custom Pipeline Options 7m
- Demo: Flink Runner and Spark Runner 4m
- Demo: Schema Specification and Inference 6m
- Demo: Reading Data with Schemas from Files 3m
- Transforms 1m
- Core Beam Transforms 7m
- Demo: ParDo and DoFn Filtering Operations 5m
- Demo: ParDo and DoFn Extracting and Formatting Operations 2m
- Demo: ParDo and DoFn Computation Operations 2m
- Demo: GroupByKey and Aggregations 5m
- Demo: CoGroupByKey for Joining Datasets 8m
- Demo: Combine 8m
- Demo: Flatten 3m
- Demo: Partition 3m
- Demo: Composite Transforms 4m
- User Transform Code Requirements 3m
- Stateless and Stateful Transformations 2m
- Types of Windows 5m
- Event Time, Ingestion Time, and Processing Time 4m
- Watermarks and Late Data 3m
- Demo: Fixed Windows 9m
- Demo: Sliding Windows 3m
- Demo: Session Windows 2m
- Demo: Global Windows 1m
- Demo: Side Inputs 5m
- Demo: Inner Join 5m
- Demo: Outer Joins 4m
- Demo: Join Using Side Inputs 2m
- Demo: Performing Joins Using CoGroupByKey - 1 7m
- Demo: Performing Joins Using CoGroupByKey - 2 3m
- Apache Flink and Apache Spark 2: Compatibility with Apache Beam 5m