Data Transformations with Apache Pig
Pig is an open source engine for executing parallelized data transformations which run on Hadoop. This course shows you how Pig can help you work on incomplete data with an inconsistent schema, or perhaps no schema at all.
What you'll learn
Pig is an open source software which is part of the Hadoop eco-system of technologies. Pig is great at working with data which are beyond traditional data warehouses. It can deal well with missing, incomplete, and inconsistent data having no schema. In this course, Data Transformations with Apache Pig, you'll learn about data transformations with Apache. First, you'll start with the very basics which will show you how to get Pig installed and get started working with the Grunt shell. Next, you'll discover how to load data into relations in Pig and store transformed results to files via load and store commands. Then, you'll work on a real world dataset where you analyze accidents in NYC using collision data from the City of New York. Finally, you'll explore advanced constructs such as the nested foreach and also gives you a brief glimpse into the world of MapReduce and shows you how easy it is to implement this construct in Pig. By the end of this course, you'll have a better understanding of data transformations with Apache Pig.
Table of contents
- The Structure of a Pig Script and the Concept of Relations 5m
- Loading Data from Files and Directories 4m
- Loading Data with Schema 3m
- Storing Relations in Directories 3m
- Case-sensitivity in Pig 1m
- Scalar Data Types 3m
- Complex Data Types: The Tuple 9m
- Complex Data Types: The Bag 5m
- Complex Data Types: The Map 6m
- Working with Partial Schema Specification 5m
- Download NYC Collision Data 7m
- Visualize the Group by Operation 3m
- The Group by Operation 5m
- Aggregations on Grouped Data 5m
- Join Operations on Relations 5m
- Types of Joins 5m
- Implement the Left Outer, Self, and Cross Joins 4m
- The Union Operation 3m
- The Union Onschema Operation 7m
- The Flatten Function 5m