• Labs icon Lab
  • Data
Labs

Build a Recommendation System with PySpark

In this Code Lab, you'll learn to build powerful recommendation systems using PySpark by implementing both collaborative filtering with the Alternating Least Squares (ALS) algorithm and content-based filtering with techniques like TF-IDF and Word2Vec. You’ll optimize model performance through hyperparameter tuning, address cold start problems, and enhance content-based models with dimensionality reduction. By the end, you'll gain the skills to develop personalized, scalable recommendation engines and confidently apply them to real-world scenarios such as e-commerce or media streaming.

Labs

Path Info

Level
Clock icon Intermediate
Duration
Clock icon 45m
Published
Clock icon Apr 09, 2025

Contact sales

By filling out this form and clicking submit, you acknowledge our privacy policy.

Table of Contents

  1. Challenge

    Introduction to Collaborative and Content-Based Recommendation Systems

    Goal: In this step you will understand the foundational concepts of both collaborative filtering and content-based recommendation systems, and set up the environment to implement them using PySpark.

    What Is a Recommendation Engine?

    A recommendation engine is a system that predicts user preferences and suggests relevant items based on past interactions or content attributes. These engines power product recommendations on platforms like Netflix, Amazon, and Spotify, helping businesses improve customer engagement and retention.. Examples include:

    • Netflix recommending movies based on watch history.
    • Amazon suggesting products based on previous purchases.

    There are two main types:

    1. Collaborative Filtering: Uses user-item interactions (e.g., purchases, ratings).
    2. Content-Based Filtering: Uses item characteristics (e.g., product category, description). ### Learning Objective of this Lab By the end of this lab, you will be able to:
    3. Implement a collaborative filtering model using ALS (Alternating Least Squares) in PySpark.
    4. Build a content-based recommendation system using TF-IDF, Word2Vec, and cosine similarity.
    5. Improve model performance using hyperparameter tuning, cold-start strategies, and dimensionality reduction. ---

    💡 Code Explanation: Here's a breakdown of the solution code:

    1. Import libraries: SparkSession is needed to create a Spark environment.
    2. Create a Spark session: builder.appName("RecommendationSystemLab").getOrCreate() initializes PySpark.

    📌 Why this task matters? PySpark enables scalable data processing for recommendation models.

    📝 Notes:

    • PySpark requires a SparkSession to process data.
    • The necessary libraries are already installed (pyspark). ---

    💡 Code Explanation:

    • You specified the file path and loaded the dataset using spark.read.csv.
    • The inferSchema=True parameter lets Spark automatically detect column data types.
    • In the main execution (if __name__ == "__main__":), the .printSchema() to check the structure and .show(5) to display sample rows.

    📌 Why this task matters? Understanding the dataset structure helps us determine what preprocessing is required.

    📝 Notes:

    The dataset contains transactions and explicit ratings, so we’ll use the rating column the explicit feedback. Key columns for recommendations:

    • customer_id (user)
    • product_name (item)
    • category (item metadata)
    • quantity (implicit rating)
    • rating (explicit feedback) > 📌 Why this task matters?

    Collaborative filtering requires structured user-item interaction data.

    📝 Notes:

    • You will use customer_id, product_name, and rating as implicit feedback.

    💡 Code Explanation:

    • Extract necessary columns (customer_id, product_name, and rating).
    • We use StringIndexer to convert categorical values into numerical indices for modeling. > 📌 Why this task matters?

    Text-based content features need to be vectorized for similarity calculations.

    💡 Code Explanation:

      dataframe = dataframe
          .select(
              col("product_name"),
              col("category"),
              concat(
                  col("product_name"), lit(" "),
                  col("category")).alias("product_description"))
          .distinct()
    
    • Select distinct products and build a new feature “product_description” by combining product_name and category. The idea here is to create a column that captures enough information about the products.
    The Tokenizer function
      # Tokenization
      tokenizer = Tokenizer(inputCol="category", outputCol="tokens")
      content_df = tokenizer.transform(dataframe)
    
    • Splits a string into individual words based on whitespace or punctuation.
    • Here it is used to break down the category column into individual words for analysis.
    The StopWordsRemover function
      # Removing Stop Words
      stopwords_remover = StopWordsRemover(inputCol="tokens", outputCol="filtered_tokens")
      content_df = stopwords_remover.transform(content_df)
    
    • Removes common words (stopwords) such as "and," "the," and "is" that do not add meaningful information.
    • Here it helps improve model accuracy by focusing only on relevant words.
    The HashingTF (Hashing Term Frequency) function
      # Convert Text to Numerical Features
      hashingTF = HashingTF(inputCol="filtered_tokens", outputCol="rawFeatures", numFeatures=10)
      featurizedData = hashingTF.transform(content_df)
    
    • Converts a list of words into a fixed-length vector representation based on term frequency.
    • It uses a hashing function to assign words to vector indices.
    The IDF (Inverse Document Frequency) function
      idf = IDF(inputCol="rawFeatures", outputCol="features")
      idfModel = idf.fit(featurizedData)
      rescaledData = idfModel.transform(featurizedData)
    
    • Adjusts the importance of terms by reducing the weight of common words and increasing the weight of rare words.
    • It helps identify distinguishing terms in the category column.

    These transformations help convert textual product descriptions into numerical representations that will later be used to compute similarity between products. Now each product category has a vectorized representation capturing its unique characteristics.

  2. Challenge

    Implementing Collaborative Filtering with ALS

    Collaborative Filtering

    Collaborative filtering works on the simple but powerful principle: people who have shown similar interests in the past are likely to have similar preferences in the future.

    User-based Collaborative Filtering 📌

    This method finds users with similar purchase behaviors and recommends items they have liked.

    • Example: If Alice and Bob both buy coffee and pastries, and Bob buys a croissant but Alice hasn’t, Alice might receive a croissant recommendation.
    Item-based Collaborative Filtering 📌

    This method looks at item relationships. It recommends items that are frequently purchased or interacted with together.

    • Example: If most customers who buy bananas also buy peanut butter, then peanut butter should be recommended to a banana lover!

    In this step, you will:

    1. Split the data into training and testing sets.
    2. Train an ALS model using the training data.
    3. Generate predictions on the test data.
    4. Evaluate the model using Root Mean Squared Error (RMSE).

    Why ALS?

    Collaborative filtering works great but can suffer from sparsity—most users haven’t rated or purchased every item. This is where ALS shines! 🌟

    ALS reduces a large and sparse user-item matrix into a lower-dimensional form, helping us identify hidden patterns in user behavior and make recommendations efficiently.

    How Does ALS Work?
    • ALS factorizes the massive user-item interaction matrix into two smaller matrices. The idea is that multiplying these two matrices approximates the original user-item matrix.
    • ALS doesn’t compute everything at once—it alternates between optimizing one matrix while keeping the other fixed. This makes it computationally efficient and suitable for big data environments like PySpark.

    Evaluating the Model

    You will assess the model's performance using Root Mean Square Error (RMSE): RMSE (Root Mean Square Error) is a common metric used to measure the accuracy of a recommendation model. It calculates the average difference between the actual ratings and the predicted ratings, penalizing larger errors more than smaller ones.

    Lower RMSE = Better Predictions

    If RMSE = 0, the model would be perfect, meaning predictions match actual ratings exactly (which never happens in real-world scenarios). If RMSE is too high, the model's predictions are inaccurate.

    So What Does an RMSE of 1.4131 Mean?

    Since ratings are typically on a scale (e.g., 1 to 5 in many systems), an RMSE of 1.4131 suggests our predictions are off by about 1.4 rating points on average. This means if a customer gave a product a rating of 4, our model might predict something like 2.6 or 5.4, which is not ideal.

    An RMSE of 1.4131 means our model makes decent predictions but has room for improvement. By handling the cold start problem and fine-tuning ALS, we may achieve more accurate recommendations for users. 🚀

  3. Challenge

    Handling Cold Start Problems

    Handling Cold Start Problems

    One challenge in recommendation systems is the Cold Start Problem—how do we recommend products to new users or items with no history? There are 3 main types of cold start problems:

    • New User Problem: If a user is new, there’s no purchase history to base recommendations on. Solution? Use popular items as a starting point.
    • New Item Problem: If a product has just been added, there’s no user interaction data yet. Solution? Recommend based on item metadata (e.g., category, description).
    • Sparsity Problem: Even with many users and products, most interactions are sparse. Solution? Hybrid models that blend collaborative filtering with content-based recommendations.
    How PySpark Addresses Cold Start Issues

    PySpark’s ALS (Alternating Least Squares) algorithm includes a parameter called coldStartStrategy, which helps manage predictions for unseen users or items. The possible values for this parameter are:

    1. "nan" (default)
      • Here predictions are made for all users and items, even if some have not appeared in the training set.
      • This can result in NaN (Not a Number) values for new users or items with no historical interactions.
    2. "drop"
      • Here any predictions that would result in NaN values are removed from the output.
      • This is useful when evaluating the model, as it prevents errors caused by missing predictions.
  4. Challenge

    Hyperparameter Tuning with Cross-validation

    What Is Hyperparameter Optimization?

    Hyperparameters are settings that are not learned by the model but set before training begins. Hyperparameter optimization involves systematically adjusting model parameters to improve performance while balancing computational efficiency.

    In ALS, key hyperparameters include:

    • Rank: this is the number of latent factors used in matrix factorization.
    • Regularization Parameter (regParam): Prevents overfitting by adding a penalty to large weights.
    • Max Iterations (maxIter): Number of times ALS updates the matrices to reduce error.
    Why Is Hyperparameter Tuning Essential?

    Without tuning, the model might:

    • ❌ Overfit (it will be too complex, memorizes training data and performs poorly on new data)
    • ❌ Underfit (Too simple and fails to capture patterns)
  5. Challenge

    Building a Content-based Recommendation System

    A content-based recommendation system suggests items to users based on the attributes of items they have previously interacted with. It assumes that if a user likes a particular item, they will also like items that share similar characteristics.

    How Does Content-based Filtering Work?

    Content-based filtering works by

    • Feature Extraction – Converting item attributes (e.g., text descriptions, price, category) into numerical representations.
    • Similarity Calculation – measuring how similar two items are using methods like TF-IDF (for text-based features) or cosine similarity.
    • Recommendation Generation – suggesting items that are most similar to what the user has interacted with before.

    For example, if a user frequently buys organic green tea, a content-based system might recommend organic herbal tea based on shared keywords in their descriptions.

    Why Use Content-based Filtering?

    Content-based filtering

    • Works well for cold-start items (new products) that have no user interactions.
    • It also provides personalized recommendations tailored to individual preferences.
    • And can also be combined with collaborative filtering for a hybrid recommendation approach.
    How Does TF-IDF Capture Product Characteristics?

    TF-IDF (Term Frequency-Inverse Document Frequency) is a technique that assigns importance to words in a collection of documents. It works by:

    • Calculating Term Frequency (TF): Measuring how often a word appears in a document.
    • Applying Inverse Document Frequency (IDF): Reduces the weight of common words (like "the", "a") and increases the weight of rare, meaningful words.

    For instance, in a dataset of product descriptions:

    • If the word "headphones" appears in almost all documents, it will be assigned a low IDF.
    • If the word "Bluetooth" appears in fewer documents, it will be assigned a high IDF, making it more important.

    By converting text into a weighted vector representation, TF-IDF allows us to compare product similarities based on their meaningful features.

    NOTE 🚀:

    In Step 1 - Task 4, you implemented TF-IDF when you prepared the dataset for content-based filtering.

  6. Challenge

    Enhancing Content-based Recommendations with Word2Vec

    Using Word2Vec for Dense Product Embeddings

    While TF-IDF represents text as sparse vectors based on word frequency, Word2Vec creates dense embeddings by learning word relationships from large text corpora. Word2Vec Works by:

    • Continuous Bag of Words (CBOW): predicting a word based on its surrounding words.
    • Skip-gram Model: predicting surrounding words given a central word.

    This approach learns semantic relationships between words, enabling you to capture deeper product similarities.

    For example, after training on product descriptions:

    • The word "phones" might be close to "phone book" in the vector space.
    • "Laptop" and "Notebook" might also be close.
    When to Use Which?
    • TF-IDF works well for explicit keyword matching.
    • Word2Vec is better when we need semantic understanding.

    Combining both methods in hybrid models can improve recommendation accuracy. 🚀

  7. Challenge

    Improving Model Performance with Dimensionality Reduction

    What Is Dimensionality Reduction?

    Dimensionality reduction is a technique used to reduce the number of features in a dataset while preserving important patterns. This helps:

    • Improve computational efficiency.
    • Reduce storage and memory requirements.
    • Minimize the risk of overfitting by removing redundant information. AND
    • Speeds up similarity calculations.

    When we generate TF-IDF features, we often create high-dimensional vectors. High-dimensional spaces slow down similarity calculations and increase computational costs. To overcome this, we use Principal Component Analysis (PCA). PCA finds the most important axes (principal components) in the data and projects feature vectors onto these lower-dimensional spaces.

    Why PCA?
    • PCA reduces feature space while preserving the most important variance in the data.
    • It works well with dense feature vectors, which are produced by TF-IDF.
    • PCA is easy to implement in PySpark MLlib and is computationally efficient.
    Why Not Singular Value Decomposition (SVD)?

    While SVD is another popular dimensionality reduction method, we don’t use it here because:

    • SVD works best with sparse matrices (like those found in collaborative filtering). TF-IDF results in dense vectors, making PCA more appropriate.
    • SVD requires matrix factorization, which can be computationally expensive on large-scale datasets in a distributed PySpark environment.
    • PySpark does not have built-in distributed SVD support for dense matrices, whereas PCA is directly available in PySpark MLlib. ##### Comparison of Similarity Index Before and After PCA |Metric | Before PCA | After PCA| | -------- | -------- | -------- | |Similarity Index (BREAD vs CHEESE) | 0.597 | 0.87
    Key Observations
    1. Increased Similarity After PCA
      • The similarity index increased from 0.597 to 0.87 after applying PCA.
      • This suggests that PCA captured more relevant features, improving how similar products are represented.
    2. Effect of Dimensionality Reduction
      • Before PCA, the vectors had 5 dimensions, which might have included noise or redundant information.
      • After PCA, the vectors were compressed to 3 dimensions, retaining only the most important features related to product similarity.
    3. Better Feature Representation
      • PCA helps by removing less significant variations, making the feature space more compact and interpretable.
      • The increase in similarity suggests that BREAD and CHEESE share core characteristics, which PCA highlights more effectively.
  8. Challenge

    Final Evaluation and Comparison of Models

    Comparing Content-Based and Collaborative Filtering

    Now that you've implemented both content-based and collaborative filtering approaches, compare them! ⚖️

    | Aspect | Collaborative Filtering | Content-Based Filtering | | -------------- | ------------- | ------------ | | How it works | Uses user-item interactions | Uses item attributes | | Strengths | Learns user preferences well | Works even with little user data | | Challenges | Cold start problem for new users/items | Requires well-structured metadata | Best For | Personalized recommendations | Finding similar items |

    🚀 When to Use Which?

    Use Collaborative Filtering when you have rich user interaction data. Use Content-Based Filtering when you have detailed item metadata and want to recommend similar items.

    🚀 Lab Summary:

    In this lab, you explored the foundations, implementation, and optimization of recommendation systems using collaborative filtering and content-based filtering in PySpark.

    1. You began by understanding the core principles behind these recommendation techniques and prepared datasets for both approaches. With collaborative filtering, you leveraged the Alternating Least Squares (ALS) algorithm to build a model that predicts user preferences. You then refined a model through hyperparameter tuning and tackled the cold start problem, ensuring better recommendations for new users and products.

    2. Shifting gears, you built a content-based recommendation system, transforming product descriptions into numerical features using TF-IDF and computing item similarity with cosine similarity.

    3. To boost performance, you applied dimensionality reduction technique PCA reducing computational complexity while maintaining recommendation quality. Finally, you evaluated and compared both collaborative and content-based models, weighing their strengths and weaknesses to determine the most effective approach.

Bismark is a BI & Big Data Engineer obsessed with applying his knowledge in computer engineering and mathematics in the fields of Data Science, Artificial Intelligence, Machine Learning, Big Data, and Human Computer Interaction to find disease cures, provision of better healthcare and technology, autonomous systems, education and productivity through research into novel methods and algorithms for computation.

What's a lab?

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Provided environment for hands-on practice

We will provide the credentials and environment necessary for you to practice right within your browser.

Guided walkthrough

Follow along with the author’s guided walkthrough and build something new in your provided environment!

Did you know?

On average, you retain 75% more of your learning if you get time for practice.