SQL on Hadoop - Analyzing Big Data with Hive
This course will teach you the Hive query language and how to apply it to solve common Big Data problems. This includes an introduction to distributed computing, Hadoop, and MapReduce fundamentals and the latest features released with Hive 0.11
What you'll learn
From developer to analyst, this Hive SQL course tackles a few big questions about big data:
- Why does this technology exist and why do I need it?
- How can I get the best out of it utilizing something familiar like SQL?
- How does this all fit together in an ever-evolving eco-system?
The course presents some challenges you might experience solving real production problems and how Apache Hive makes that task easier to accomplish.
Table of contents
- Introduction 1m
- Hive Motivation 2m
- Hive Architecture 2m
- Hive Principles - Schema on Read 1m
- Hive Principles - The Hive Warehouse 2m
- Hive Query Language Basics - SELECT and Sub Queries 4m
- Creating Databases and Tables with HiveQL 8m
- Demo: Working with Hive Tables and Loading Data into Warehouse 12m
- Loading Data - Hive Managed and External Tables 2m
- Demo: External Tables and Create Table Alternatives 10m
- Summary 1m
- Introduction 1m
- Data Types 8m
- Type Conversions 1m
- Managed Partitioned Tables 7m
- External Partitioned Tables 4m
- Demo: Table Partitioning 19m
- Multi Inserts and Dynamic Partition Inserts 14m
- Demo: Loading Data Use Case 6m
- Data Retrieval - Group By and Functions 13m
- Sorting and Controlling Data Flow 8m
- The CLI and Variable Substitution 7m
- Summary 1m
- Introduction 1m
- Bucketing 4m
- Bucket and Block Sampling 4m
- Joins 4m
- Joins in Depth and Join Optimizations 6m
- Map-side Joins for Bucketed Tables 2m
- Distributed Cache 3m
- UDTFs, Explode and Lateral View 6m
- Demo: Extending Hive - Creating Your own UDF 7m
- Demo: Extending Hive - Compiling and Testing Custom UDF 5m
- Extending Hive - Custom UDF Recap 3m
- Demo: Hive Initialization File 1m
- Accessing The Distributed Cache 1m
- Hadoop Streaming and Transform() 5m
- Windowing and Analytics Functions 3m
- Demo: Putting it All Together Using Transform 13m
- Demo: Analytics Functions 4m
- Demo: Ranking Functions 5m
- Summary 1m
Course FAQ
Hadoop is a software framework for storing and processing large sets of data across clusters of hardware. It has large storage for all kinds of data, incredible processing power, and it can handle a seemingly infinite amount of tasks at the same time.
Hive is a data warehouse software project built on top of Hadoop which provides data query and analysis. SQL is a programming language for working with large sets of data in relational databases. While they both query and program big data, Hive handles complicated data more effectively than SQL.
This course will introduce you to Hadoop and the Hive query language. Some of the topics covered include:
- The concepts of distributed computing
- What is MapReduce
- Creating databases and tables with HiveQL
- Multi inserts and dynamic partition inserts
- Bucket and block sampling
- Storage and the ecosystem
- Much more
This course is great for anyone who wants to learn Hadoop, Hive, and the Hive query language (HiveQL). If you want to be able to solve common Big Data problems, then this is perfect for you.
This is an intermediate level course, so it does assume some prior knowledge of working with Big Data and query languages like SQL. However, no prior knowledge of Hadoop or Hive is expected.