HDInsight Deep Dive: Storm, HBase, and Hive
HDInsight is Microsoft's managed Big Data stack in the cloud. With Azure you can provision clusters running Storm, HBase, and Hive which can process thousands of events per second, store petabytes of data, and give you a SQL-like interface to query it all. In this course, we'll build out a full solution using the stack and take a deep dive into each of the technologies.
What you'll learn
Storm is a distributed compute platform which you can plug into Azure Event Hubs and use to power event stream processing. You can scale Storm to read tens of thousands of events per second and build a reliable workflow so that every event is guaranteed to be processed. HBase is a No-SQL database which is easy to get started with and can store tables with billions of rows and millions of columns. It's for real-time data access and it has a REST interface so you can read and write HBase data from a .NET Storm app. Hive is a data warehouse that provides a SQL-like interface over Big Data - HBase tables, and other sources. With Hive you can join across multiple sources and run queries from PowerShell and .NET. In this course, we use all three technologies running on Microsoft Azure to build a race timing solution and dive into performance tuning, reliability, and administration.
Table of contents
- Module Introduction 1m
- HBase Cluster Nodes 4m
- HFiles and Regions 4m
- HBase Data Structure in Azure 4m
- Meta Tables and Region Splits 3m
- Splitting and Pre-splitting Regions 4m
- .NET and HBase Best Practices 3m
- Integration Testing with Docker 3m
- Load Balancing Stargate 4m
- Performance Analysis 4m
- Scaling and Compaction 5m
- Module Summary 4m
- Module Introduction 2m
- Storm Application Architecture 3m
- Processing Race Timing with Storm 3m
- Saving Events and Defining Bolt Schemas 4m
- Local Memory Caches in Bolts 4m
- Buffering Writes in Bolts 3m
- Flushing Buffers with the Tick Stream 3m
- Designing the Topology 4m
- Building the Topology 5m
- Deploying to HDInsight 3m
- Running Race Simulations 2m
- Module Summary 2m
- Module Introduction 2m
- Storm Cluster Architecture 3m
- Runtime Compute Components 4m
- Approaches to Performance Testing 4m
- Performance Tuning Storm 6m
- Scaling the Storm Cluster 3m
- Guaranteed Messaging 3m
- Implementing Tuple Trees 5m
- Logging and Monitoring 3m
- Custom Component Logging 3m
- Unit & Integration Testing 4m
- Module Summary 3m
- Module Introduction 2m
- HiveQL, the Hive Query Language 3m
- Mapping HBase Tables in Hive 2m
- Hive Data Types and HBase Column Families 4m
- Querying Race Results 4m
- Mapping Flat Files in Hive 4m
- Joining HBase Tables and CSV Files in HiveQL 3m
- Writing Data to Azure from Hive 3m
- Recalculating Race Results 3m
- Hive Views and Functions 3m
- Collections, Joins, and Ranking 3m
- Hive and Big Data 3m
- Module Summary 2m
- Module Introduction 2m
- Hive and YARN 3m
- Parallelism for Hive Queries 3m
- HiveQL Execution Plans 3m
- Filtering HBase Tables 3m
- Parameterising Hive Queries with PowerShell 3m
- Running Parallel Hive Queries with PowerShell 4m
- The Hive ODBC Connector 4m
- Connecting to Hive from .NET Apps 4m
- Writing Hive UDFs in C# 4m
- Customizing the HBase Cluster for Hive 4m
- Course Summary 2m