- Lab
- Data

Perform Data Cleaning Operations with PySpark
In this intermediate lab, you will apply your data exploration and preprocessing skills in Python to clean the New York City Airbnb Open Data using PySpark. You will work with this dataset containing missing values, null entries, and duplicate records. Through guided steps, you will impute missing data based on relevant factors and implement deduplication techniques to ensure data consistency and integrity.

Path Info
Table of Contents
-
Challenge
Introduction
Perform Data Cleaning Operations with PySpark
Hello, learners!
In this lab, you will apply your data exploration and preprocessing skills in Python to clean the New York City Airbnb Open Data 🏨 using PySpark.
Key Concepts
-
PySpark Data Processing
- Spark Session Initialization
- Reading CSV Data
- Defining Schema
-
Data Aggregation & Transformation
- GroupBy and Aggregation
- Computing Mean, Median, and Mode
- Joining DataFrames
- Updating Column Values with
when
-
Feature Engineering
- Handling Categorical and Numerical Data
- Filtering Data
- Replacing Zero Values
-
Data Quality Checks
- Counting Specific Rows
- Identifying Duplicate records
Key Steps
You will perform the following steps in this lab:
- Step 1: Load the dataset
- Step 2: Impute missing values for hotel prices and reviews
- Step 3: Impute missing values for hotel neighbourhoods
- Step 4: Update hotel availability
- Step 5: Drop duplicate bookings per host
Project Structure
The FILETREE consists of the following directories and files:
dataset
: Holdsdata.csv
file with six features (host_id
,neighbourhood_group
,neighbourhood
,price
,number_of_reviews
, andavailability_365
).src
: Holds five files (step1.py
,step2.py
,step3.py
,step4.py
, andstep5.py
) for you to write codes for each task within each of the five steps.
Let's go!⚡
-
-
Challenge
Load the dataset
In this step, you will work on the
src/step1.py
file. The file already contains the necessary import statements and stores the link to thedata.csv
file in thepath
variable.You will complete four tasks in this step from initializing the Spark session to defining the correct schema for the dataset before loading it into a PySpark DataFrame.
Let us proceed with implementing the tasks in
src/step1.py
! > 📝NOTE- Comment out all the lines that display content to the console after completing all the tasks.
- You are provided with the
load_data
function that initializes a Spark session and load the dataset with your custom schema. You will use this function in the following steps.
-
Challenge
Impute missing values for hotel prices and reviews
Now that you have loaded your dataset with a custom schema, let us dive into the data cleaning operations. In this step, you will impute missing values for hotel prices and reviews.
Navigate to the
src/step2.py
file to handle null values in the numerical columns -price
andnumber_of_reviews
.The file already has the necessary import statements for all the tasks within this step and also stores the DataFrame in the variable
df
. -
Challenge
Impute missing values for hotel neighbourhoods
So far, you have handled
NULL
values for two integer columns. Now, let's focus on filling in theNULL
values for the string data type column—neighbourhood
.The main goal of this step is to fill
NULL
values in theneighbourhood
column by determining the most frequently occurring value within each neighbourhood group.You will work in the
src/step3.py
file which already contains the necessary import statements and DataFrame.ℹ️ INFO
Just like filling missing values in numerical columns, you can also handle categorical columns using various methods beyond mode, such as forward/backward fill, nearest neighbor imputation, and predictive techniques.
-
Challenge
Update hotel availability
In the previous data cleaning steps, you learned how to handle missing values in numerical and categorical columns. But what if the missing values in a column is not
NULL
but instead contain random/unwanted data? In this step, you will learn to handle a similar scenario where you will replace the hotel availability from 0 days in a year to 90 days.In the
data.csv
file, you have a column namedavailability_365
where missing values are represented by the value0
. You will work in thesrc/step4.py
file to replace all0
values with90
.The
step4.py
file already has the necessary imports and the DataFrame. -
Challenge
Drop duplicate bookings per host
In your final data cleaning step, you will learn how to drop rows where a feature's value appears more than once.
For your dataset, you will learn to remove records where hosts have booked a hotel more than once. In other words, retain only one record per
host_id
.You will be working in the
src/step5.py
file that has all the necessary imports and the DataFrame. In this lab, you have learned various techniques to perform data cleaning in PySpark from handling missing values in numerical and categorical columns, to replacing random data and dropping duplicate records.You can now utilize this dataset for exploratory data analysis, data visualization, or building machine learning/deep learning models.
Good luck!🏆
What's a lab?
Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.
Provided environment for hands-on practice
We will provide the credentials and environment necessary for you to practice right within your browser.
Guided walkthrough
Follow along with the author’s guided walkthrough and build something new in your provided environment!
Did you know?
On average, you retain 75% more of your learning if you get time for practice.