Lab
Data

Perform Data Cleaning Operations with PySpark

In this intermediate lab, you will apply your data exploration and preprocessing skills in Python to clean the New York City Airbnb Open Data using PySpark. You will work with this dataset containing missing values, null entries, and duplicate records. Through guided steps, you will impute missing data based on relevant factors and implement deduplication techniques to ensure data consistency and integrity.

Get started Contact sales

Path Info

Level

Intermediate

Duration

52m

Published

Mar 28, 2025

Challenge

Introduction
Perform Data Cleaning Operations with PySpark

Hello, learners!

In this lab, you will apply your data exploration and preprocessing skills in Python to clean the New York City Airbnb Open Data 🏨 using PySpark.

Key Concepts
1. PySpark Data Processing
  - Spark Session Initialization
  - Reading CSV Data
  - Defining Schema
2. Data Aggregation & Transformation
  - GroupBy and Aggregation
  - Computing Mean, Median, and Mode
  - Joining DataFrames
  - Updating Column Values with when
3. Feature Engineering
  - Handling Categorical and Numerical Data
  - Filtering Data
  - Replacing Zero Values
4. Data Quality Checks
  - Counting Specific Rows
  - Identifying Duplicate records
Key Steps

You will perform the following steps in this lab:
- Step 1: Load the dataset
- Step 2: Impute missing values for hotel prices and reviews
- Step 3: Impute missing values for hotel neighbourhoods
- Step 4: Update hotel availability
- Step 5: Drop duplicate bookings per host
Project Structure

The FILETREE consists of the following directories and files:
- dataset: Holds data.csv file with six features (host_id, neighbourhood_group, neighbourhood, price, number_of_reviews, and availability_365).
- src: Holds five files (step1.py, step2.py, step3.py, step4.py, and step5.py) for you to write codes for each task within each of the five steps.
Let's go!⚡
Challenge

Load the dataset
In this step, you will work on the src/step1.py file. The file already contains the necessary import statements and stores the link to the data.csv file in the path variable.

You will complete four tasks in this step from initializing the Spark session to defining the correct schema for the dataset before loading it into a PySpark DataFrame.

Let us proceed with implementing the tasks in src/step1.py! > 📝NOTE
1. Comment out all the lines that display content to the console after completing all the tasks.
2. You are provided with the load_data function that initializes a Spark session and load the dataset with your custom schema. You will use this function in the following steps.
Challenge

Impute missing values for hotel prices and reviews

Now that you have loaded your dataset with a custom schema, let us dive into the data cleaning operations. In this step, you will impute missing values for hotel prices and reviews.

Navigate to the src/step2.py file to handle null values in the numerical columns - price and number_of_reviews.

The file already has the necessary import statements for all the tasks within this step and also stores the DataFrame in the variable df.
Challenge

Impute missing values for hotel neighbourhoods

So far, you have handled NULL values for two integer columns. Now, let's focus on filling in the NULL values for the string data type column—neighbourhood.

The main goal of this step is to fill NULL values in the neighbourhood column by determining the most frequently occurring value within each neighbourhood group.

You will work in the src/step3.py file which already contains the necessary import statements and DataFrame.

ℹ️ INFO

Just like filling missing values in numerical columns, you can also handle categorical columns using various methods beyond mode, such as forward/backward fill, nearest neighbor imputation, and predictive techniques.
Challenge

Update hotel availability

In the previous data cleaning steps, you learned how to handle missing values in numerical and categorical columns. But what if the missing values in a column is not NULL but instead contain random/unwanted data? In this step, you will learn to handle a similar scenario where you will replace the hotel availability from 0 days in a year to 90 days.

In the data.csv file, you have a column named availability_365 where missing values are represented by the value 0. You will work in the src/step4.py file to replace all 0 values with 90.

The step4.py file already has the necessary imports and the DataFrame.
Challenge

Drop duplicate bookings per host

In your final data cleaning step, you will learn how to drop rows where a feature's value appears more than once.

For your dataset, you will learn to remove records where hosts have booked a hotel more than once. In other words, retain only one record per host_id.

You will be working in the src/step5.py file that has all the necessary imports and the DataFrame. In this lab, you have learned various techniques to perform data cleaning in PySpark from handling missing values in numerical and categorical columns, to replacing random data and dropping duplicate records.

You can now utilize this dataset for exploratory data analysis, data visualization, or building machine learning/deep learning models.

Good luck!🏆

Author

Chhaya Wagmi

Written content author.

What's a lab?

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Provided environment for hands-on practice

We will provide the credentials and environment necessary for you to practice right within your browser.

Guided walkthrough

Follow along with the author’s guided walkthrough and build something new in your provided environment!

Did you know?

On average, you retain 75% more of your learning if you get time for practice.

Ready to get started?

View individual plans View team plans

Perform Data Cleaning Operations with PySpark

Path Info

Table of Contents

Introduction