How to Build a Netflix Style Recommendation Engine with Amazon SageMaker
Machine learning is at the heart of Netflix's recommendation engine. We wanted to see how it worked, so we built one using Amazon SageMaker.
Jun 08, 2023 • 18 Minute Read
Have you ever wondered how Netflix recommends movies to you? As a techie, I’ve always been curious. I knew machine learning was behind it, and I wondered if the same method could recommend courses on A Cloud Guru. I was so interested in this idea I posed it as a #CloudGuruChallenge. It was a fun challenge and a great way to explore machine learning. To get a solid understanding of the services and platforms available on AWS for Machine Learning projects there is the Amazon Machine Learning Course. Read on to see my recommendation engine code.
Training
A Matter of Taste: Machine Learning on Amazon Web Services (AWS)
Netflix makes recommendations based on your viewing history, what other members with similar tastes watch, and metadata about the movies — like genre, categories, and more. These same principles could easily be applied to make course recommendations. We track viewing history and metadata about our courses. The only missing component was an easy way to compare students to determine if they have similar tastes. Luckily clustering, a machine learning technique, can be used to classify each student in a specific group.
To begin this learning adventure, I teamed up with my colleague Julie Elkins to solve this challenge using machine learning and Amazon SageMaker. During this adventure, we learned a lot and produced a proof of concept (POC) for an engine that recommends courses to students using machine learning! Our POC recommends titles to you based on what you've watched and mastered along with what others in your assigned learning cluster have watched and mastered. It’s really cool!
The Benefits of a Course Recommendation Engine
Why consider building a course recommendation engine in the first place? Well, for the personal learning challenge, of course!
Along with personal growth and development, A Cloud Guru recently launched a new combined platform with Linux Academy offering 250% more courses than before, 470+ quizzes and exams, and 1,500+ Hands-On Labs! How are you going to navigate all of that content? A machine learning-enabled recommendation engine could be just the tool to help you.
Our overall hope is that a machine learning-enabled recommendation engine will increase the viewership of courses and student engagement since we’re recommending courses specifically tailored to your interests and mastery.
The Big Picture
If you’re not familiar with AWS machine learning, it allows a computer to study data and find trends and patterns that may otherwise be hidden.
Tip: To learn more about machine learning, watch this episode of Kesha’s Korner to come up to speed.
The key to making this solution work was our chosen machine learning technique that allowed us to group (or cluster) like students. There are several machine learning techniques (supervised, unsupervised, reinforcement, transfer, etc.) and learning algorithms to choose from. We chose our technique based on the accessible data points and the learning algorithm based on the results we expected (i.e. groupings of students). We landed on unsupervised learning for our technique because our data was not labeled and K-means as the learning algorithm because we needed to group our students into a specific number (or k number) of clusters (or groups) based on similarities.
Our tool of choice to build the machine learning model was Amazon SageMaker. We chose Amazon SageMaker because it provides compute instances running Jupyter Notebooks. Luckily for us, SageMaker managed creating the instance and all of the related resources needed. We used our notebook instance to prepare our data and run the clustering algorithm.
So, how does it work? The K-means learning algorithm is fed the progress a student has made on courses across chosen categories such as AWS, Python, containers, DevOps, machine learning, and more.
Then, the learning algorithm clusters (or groups) students in learning groups based on the similarities it identifies. Imagine it like this, each student goes through a magic funnel for evaluation. The magic funnel determines which learning group (or cluster) a student should be placed in.
After the machine learning algorithm evaluates each student, we are left with distinct groups.
Going forward, we call these clusters learning groups. Course recommendations will be made based on the student’s assigned learning group.
Steps to Produce the Learning Groups
Let’s review the steps we went through to produce the learning groups. First, we obtained the data, then we prepared it, and lastly, we fed it to the machine learning algorithm to produce the groups. Let’s talk about each step in detail.
Our Data
The most important part of a machine learning project is having reputable data. For this POC, we used employee data. One cool benefit of working for A Cloud Guru is that we have free access to the content we create! We are always learning new things so we had a lot of data to feed to the machine learning algorithm! We started with two data files: courses.csv and students.csv.
Note: I do want to highlight that the data files provided in this example have been anonymized and slightly simplified for this blog post.
A list of courses
Courses.csv contains a list of courses and their associated categories.
A list of students
Students.csv contains a list of the courses our employees have watched along with the associated progress.
Data Preparation
The data points provide a great starting point. Although we have this data, the files as they are aren’t good enough for a machine to learn from. We need to process and prepare the data so that the learning algorithm can easily find trends and patterns. SageMaker is flexible enough to allow us to do this, so we launched a SageMaker Jupyter hosted notebook instance and got to work writing Python code.
To process the data, we imported the necessary libraries. Then, we imported courses.csv into a Pandas DataFrame called pd and displayed the first few rows.
Tip: Pandas is a Python data analysis library that provides a tabular data structure for manipulating data via a DataFrame.
# Import the course data courses_df = pd.read_csv('data/courses.csv') #review first few records to verify import courses_df.head()
Then, we checked the data for null (or empty) records and removed them. We don’t want the learning algorithm to learn from records with missing data because that could cause the algorithm to find incorrect patterns.
#View records that are NaNcourses_df[courses_df.categories.isnull()]
courses_df.dropna(subset=['categories'], inplace=True)
Next, we imported students.csv into the Pandas DataFrame and verified the data by printing the first few rows.
students_df = pd.read_csv('data/students.csv')students_df.head()
Data Inspection & Visualization
When dealing with data, it’s important to have domain knowledge because it helps you easily spot and remedy issues. There are many ways to explore and get to know your data. We decided to use Matplotlib — a Python-based data visualization library — and additional techniques to better understand our data, specifically looking for patterns and outliers.
Data Distributions
We used histograms to better understand the distribution of Course Ids and Student Ids across the dataset.
plt.hist(students_df.courseId)
plt.hist(students_df.studentId)
How to Build a Netflix Style Recommendation Engine with Amazon SageMaker
Image 18: Student Id histogram
Student Level Analysis
We were curious about the popularity of courses, so we printed the students that watched course #73.
students_df[students_df.courseId == 73]
Image 19: Course 73 results
Next, we wanted to understand which courses each student watched. We built a cross-tabulation table, which is used to show the frequency with which certain groups of data appear.
pd.crosstab(students_df.studentId,students_df.courseId)
Image 20: Course 73 results
Next, we wanted to understand which students watched the most content. We counted the amount of times a studentId appears in the dataset.
students_df.groupby('studentId').size()
Image 21: Count by studentId
Next, we wanted to understand activity on a student level. For example, we selected student #61 to understand their watch history.
students_df.loc[students_df['studentId'] == 61]
Image 22: Student activity
Course Level Analysis
Next, we wanted to find out which courses were the most popular. We counted the number of times a courseId appeared in the dataset.
students_df.groupby('courseId').size()
Image 23: Course watch count by id
Lastly, we wanted to understand which students watched a particular course. In this case, we selected course #5.
students_df.loc[students_df['courseId'] == 5]
Image 24: Details for Course #5
Data Transformation
Now that we have a good understanding of our data, we can move to the next step which is to transform the data for machine consumption. Unfortunately, a machine learning algorithm cannot find trends and patterns in the data files as they are currently presented. The two files are disjointed and need to be combined into one file that shows the category watch time across all students and all categories. In order to do that, a little Python code is required.
We created a function to return a distinct list of categories.
def get_list_of_categories(courses): category_list = [] for category in courses.categories.str.split('|'): for name in category: if name not in category_list: category_list.append(name.strip()) return category_list
We also created a function to create a unique column name for each category.
def get_column_name_list(category_list): column_name = [] for category in category_list: column_name.append('avg_' + category.strip() + '_watch') return column_name
We then used the two functions we created with additional logic to calculate the watch time across all students and categories.
#category watch time across ALL students across ALL categories def get_all_category_watch_time(students, courses): category_progress = pd.DataFrame(columns = ['studentId']) category_list = get_list_of_categories(courses) column_names = get_column_name_list(category_list) #add studentId to list of columns column_names.insert(0,'studentId') for category in category_list: course_categories = courses[courses['categories'].str.contains(category)] #determine the average watch time for the given category; retain the studentId avg_watch_time_per_user = students[students['courseId'].isin(course_categories['courseId'])].loc[:, ['studentId', 'progress']].groupby(['studentId'])['progress'].mean().round(2).reset_index() #merge the progress for the given catetgory with the prior categories category_progress = category_progress.merge(avg_watch_time_per_user, on='studentId', how='outer') category_progress.columns = column_names return category_progress # Calculate the average rating all categories per user category_watch_time_df = get_all_category_watch_time(students_df, courses_df) category_watch_time_df
How to Build a Netflix Style Recommendation Engine with Amazon SageMaker
Image 25: Category watch time across all students and categories
We then analyzed the watch time for our favorite student, #61.
category_watch_time_df.loc[category_watch_time_df['studentId'] == 61]
Image 25: Category watch time for student #61
When reviewing the data, we noticed that courses that weren’t watched by the student had a value of NaN; therefore, we converted NaN to a 0 watch time. A machine cannot understand NaN but it can understand 0.
#replace NaN with 0 category_watch_time_df = category_watch_time_df.fillna(0)
We then removed the studentId because it is of no benefit during the clustering (or grouping) process.
#remove student id from dataframe category_watch_time_list = category_watch_time_df.drop(['studentId'], axis=1)
We are left with the final training data. A machine can learn from this!
Image 26: Excerpt of final transformed data
Now that we have the data in format that a machine learning algorithm can learn from, we start the training process using the K-means learning algorithm. We chose Scikit-learn because it’s very easy for beginners to learn and their K-means implementation is simple and efficient for our dataset.
We converted our data to a list, as required by K-means.
# Turn our dataset into a list category_watch_time_list = category_watch_time_list.values print(category_watch_time_list)
We then imported K-Means.
# Import KMeans from sklearn.cluster import KMeans
Next, we created an instance of K-Means and set the algorithm to find 20 clusters (or groups). When using K-means, the results may change with each run. To counteract varying results, we set the random_state to 0, which makes our results reproducible.
# Create an instance of KMeans to find 20 clusters km = KMeans(n_clusters=20, random_state=0)
We then used the instance to cluster students and stored the results in predictions. Predictions will contain the assigned cluster for each student.
# Use fit_predict to cluster the dataset # Returns a cluster prediction for each student / ie cluster labels predictions = km.fit_predict(category_watch_time_list)
Lastly, we printed the assigned clusters.
print(predictions)
How to Build a Netflix Style Recommendation Engine with Amazon SageMaker
Image 28: Assigned clusters
Map Student ID to Cluster Number
Now that we have the clusters, we need to map those clusters back to the studentId. First, we converted the NumPy array to a Dataframe.
cluster_df = pd.DataFrame(data=predictions) cluster_df.columns = ['assigned_cluster'] cluster_df
Then, we merged the DataFrames to determine the assigned cluster. We also dropped the unnecessary columns.
student_cluster_df = pd.DataFrame(columns = ['studentId', 'assigned_cluster']) student_cluster_df = pd.concat([cluster_df, category_watch_time_df], axis=1) student_cluster_df = student_cluster_df[student_cluster_df.columns[student_cluster_df.columns.isin(['studentId', 'assigned_cluster'])]] student_cluster_df
Cluster Analysis
Next, we wanted to understand more about each cluster that the machine learning algorithm defined. We used a variety of methods to determine this.
First, we counted the amount of times a cluster appeared in the dataset.
#count the amount of times a cluster appears in the dataset student_cluster_df.groupby('assigned_cluster').size()
Plot Cluster
Then we plotted the clusters to see a visualization.
#plot the data plt.scatter(category_watch_time_list[:,0],category_watch_time_list[:,1], c=km.labels_, cmap='rainbow')
How to Build a Netflix Style Recommendation Engine with Amazon SageMaker
Image 31: Cluster plot
Explore Cluster #6
Then we picked cluster #6 to further explore and study the student characteristics along known dimensions. We were curious about the commonalities between the students in cluster 6.
#What are the commonality between the students in cluster 6 #show students assigned to Cluster #6 student_cluster_df = student_cluster_df.loc[student_cluster_df['assigned_cluster'] == 5] #Get only the student progress records that appear in cluster 6 cluster6_students_df = students[students['studentId'].isin(student_cluster_df['studentId'])] #print students cluster6_students_df
Then we pulled out the courses watched and mastered by students in that cluster. We utilized a watch percentage above 90% to make this determination.
#convert progress percentage string to numeric data cluster6_students_df['progress'] = cluster6_students_df['progress'].str.rstrip('%').astype('float') / 100.0 #limit to only courses that are above a 90% watch rate cluster6_students_df = cluster6_students_df.loc[cluster6_students_df['progress'] > .90] cluster6_students_df
How to Build a Netflix Style Recommendation Engine with Amazon SageMaker
Image 32: Courses above a 90% watch rate
Explore categories across clusters
Now we have a list of courses watched and mastered by students in cluster #6. Next, we explored the categories across the cluster by obtaining the course categories watched by students.
#Get the course categories for courses watched (course Id) by students in cluster cluster6_courses_watched_df = courses[courses['courseId'].isin(cluster6_students_df['courseId'])] category_list = get_list_of_categories(cluster6_courses_watched_df) print("The amount of categories for Cluster 6: ", len(category_list)) print("The categories in Cluster 6", category_list)
Students in cluster #6, love AWS, Certification, Cloud, Development, Serverless, Programming, Database, and AI courses.
Now that we understand which cluster a user is assigned to and the course categories associated with that cluster, we can recommend courses to each student based on their assigned cluster when they access the platform. For example, if the logged in student is a member of cluster #6, then we recommend courses across AWS, Certification, Cloud, Development, Serverless, Programming, Database, and AI to that student. This recommendation accounts for what they’ve watched and mastered along with what their cluster members have watched and mastered.
Code Samples
The full code and data files can be found in GitHub at content-recommendation-engine.
Future Enhancements
There are several future enhancements that can be made to the POC course recommendation engine:
- The engine can be expanded to recommend not just courses, but hands-on labs, projects, blog posts, web series episodes, and more.
- Currently, clustering is informed solely on student progress; however, how a student has rated a course can also be used to make an informed decision.
- Clustering can also be informed on the start watch date and the last watch date to determine if someone really enjoyed a course. For example, did they binge-watch a course over the weekend or did it take them several months before finishing? This information can be useful for clustering.
- Clusters can be used to form study groups with your peers across the industry. Imagine being able to meet up with your cluster members at a future conference? How cool is that?
Next Steps: Learn More About Machine Learning
Are you ready to get started with machine learning? Explore the following courses in the A Cloud Guru library to learn more about machine learning.
Introduction to Machine Learning
Feeling over your head on your first machine learning project? Struggling with all the math jargon? ACG's Intro to Machine Learning will teach you the machine learning vocab and skills you need to get up to speed.
AWS Certified Machine Learning - Specialty 2020
Ace the AWS Certified Machine Learning Speciality exam and pick up best practices for data engineering, data analysis, machine learning modeling, model evaluation, and deployment in AWS. No PhD in Mathematics needed!
Machine Learning for Absolute Beginners
No tech know-how required. This ACG course is currently free and will give you the foundational understanding you need to understand what machine learning is (and isn't) — and what it can do for your business or your career.