Writing Complex Analytical Queries with Hive
Hive is a data warehouse that runs on top of the Hadoop distributed computing framework. It works on huge datasets, so this course is useful for understanding its features so you can write efficient, fast, and optimal queries.
What you'll learn
The Hive data warehouse supports analytical processing, it generally processes long-running jobs which crunch a huge amount of data. By understanding what goes on behind the scenes in Hive, you can structure your Hive queries to be optimal and performant, thus making your data analysis very efficient. In this course, Writing Complex Analytical Queries with Hive, you'll discover how to make design decisions and how to lay out data in your Hive tables. First, you'll dive into partitioning and bucketing, which are ways to reduce the data a query has to process. You'll cover how and when you use partitioning, bucketing, or both when you set up your tables. Next, you'll be introduced to the joins operation, along with covering how to deal with large tables, and run and optimize map-only joins. Lastly, you'll learn windowing functions, which allow you to write complex queries simply and easily with no intermediate tables. An important optimization with large datasets. By the end of this course, you'll develop an understanding for the little details that makes writing complex queries easier and faster.
Table of contents
- Version Check 0m
- Introduction and Prerequisites for This Course 2m
- A Data Warehouse for Analytical Processing 4m
- Hive as a Data Warehouse 3m
- Managing Huge Datasets and Writing Faster Queries 3m
- A Brief Introduction: Bucketing and Partitioning 4m
- A Brief Introduction: Join Optimizations 3m
- A Brief Introduction: Window Functions 3m
- Partitioning: The Logical Equivalent of Indexes 4m
- Data Organization with Partitions 5m
- Working with a Managed Partitioned Table 6m
- When Would You Use Partitions? 2m
- Loading from Files into a Partitioned Table 3m
- Partitioning an External Table 7m
- Partitioning Trade-offs 3m
- Introduction to Dynamic Partitioning 4m
- Implementing Dynamic Partitioning 5m
- Multi-column Partitioning 3m
- Bucketing: The Logical Equivalent of Hash Tables 5m
- The Modulo Operator as a Hashing Function 5m
- Working with Bucketed Tables 3m
- Bucketing vs. Partitioning 3m
- Implementing a Partitioned, Bucketed Table 3m
- Advantages of Bucketing 7m
- Sorting Records Within a Bucket 3m
- Sampling Data from a Hive Table 5m
- Bucket Sampling on Hive Tables 5m
- Behind the Scenes: An Introduction to MapReduce 4m
- Optimizing Joins: Join Columns and MapReduce Jobs 2m
- Implementing a Join Operation 4m
- Optimizing Joins: Streaming the Largest Table 3m
- Optimizing Joins: Bucketing and Partitioning on the Join Columns 2m
- The Left Semi-join Operator 6m
- Behind the Scenes: The MapReduce Data Flow 4m
- Behind the Scenes: MapReduce for Join Operations 4m
- Map-only Joins: The Inner Join 5m
- Map-only Joins: The Left Outer Join 3m
- Map-only Joins: The Right Outer Join 2m
- Map-only Joins: The Full Outer Join 3m
- The Bucket Map Join 5m
Course FAQ
Hive is a data warehouse, which works on huge datasets, which means any query that you run on Hive is likely to be slow and long running without the tips and tricks in this course.
This course helps you make design decisions on how to layout data in your Hive tables, partitioning and bucketing are ways to reduce the data your query has to process, understand how and when you would use partitioning, bucketing or both.
This course assumes that you have some familiarity with Hive and writing queries for it.
You should have Hive v2 which runs on top of Hadoop 2, and have the Beeline command interface to connect to Hive locally.