Skip to content

Contact sales

By filling out this form and clicking submit, you acknowledge our privacy policy.

Responsible AI: How to manage data drift in Azure ML

Every ML model will experience data drift at some point. Here's how you can use Azure's new "Data drift for detection datasets" feature to identify and correct it.

Oct 11, 2023 • 6 Minute Read

Please set an alt value for this image...

Are you using Azure for Machine Learning, and want to make sure your models are free from data drift and responsibly executed? In this article, we cover the fundamentals of data drift and how you can deal with it using Azure’s “Data drift for detection datasets” feature. We also provide a guide on how to use this feature along with example code snippets. 

What is data drift, and why is it bad?

Data drift is when a divide appears between the data used to train a machine learning model and the real world data it’s actually encountering. This misalignment can lead to incorrect predictions and a decline in the overall model’s performance.

Think of it like someone who’s taken a cybersecurity course. When they complete it, they’re up to date on the latest threats, prepared for anything! But as the landscape shifts and hackers come up with new moves, their cutting-edge knowledge becomes outdated — there’s a “drift” between what they’ve been taught to expect and what they’re actually dealing with.

The result? They’re less sure what to expect, and respond slower when thrown a curve-ball. However, unlike a cybersecurity professional, a ML model won’t flag this as an issue when this happens. Instead, you’ve got to monitor them for drift, and address it early when you detect it.

In both cases, the solution is similar: periodic “refresher courses”, or retraining, are essential to stay current and effective. Failing to update could result in inefficiency or errors, whether you're a human expert or a computational model.

Why does data drift occur?

Data drift can be caused by many factors, including changes in upstream processes, natural shifts in data distribution — even in clean, valid datasets — or alterations in how input data is collected and prepared.

How is data drift related to concept drift and model drift?

When data features and target data change, this is sometimes termed "concept drift." For instance, a pandemic can abruptly alter consumer behavior and medical protocols. Concept drift can also be cyclical, linked to seasons or elections, or gradual, as seen with aging populations or changing social norms.

The collective effects of data drift and concept drift blowing your results off course is collectively referred to as “model drift”, which can break downstream processes and corrupt data. However, there’s a silver lining in this windy world: ironically, data scientists who proactively monitor drift often discover innovative ways to make use of this shifted data.

How is data drift connected to responsible AI?

All major cloud providers publish guidelines for responsible AI. Microsoft has boiled it down to six responsible AI principles, including one titled “reliability and safety.” Model drift can ultimately result in falling short on all six principles, but it directly affects the goal to produce safe and reliable output.

How Azure’s “Data drift for detection datasets” feature helps

Data drift occurance isn’t a question of “if”, it’s “when”. To deal with this inevitability with your ML models built in Microsoft Azure, you should become familiar with the “Data drift detection for datasets” feature found in Azure Machine Learning Studio. 

(Not exactly a catchy name — I favor something like “Drift Sniffer,” with a playful Saint Bernard peeking out from behind a snowy mound  — but I digress. The current name may be less cute, but it makes the purpose reasonably clear.)

Using this feature, you can create dataset monitors that detect differences between training datasets and incoming datasets. You can also track statistical property changes over time, and set up alerts to proactively recognize drift issues. When you determine that the data has drifted too much, you can create a new version of the baseline dataset.

How to use Azure’s “Data drift for detection datasets” feature

Before we start, here’s four things you should know:

  1. You can use the Python SDK or Azure Machine Learning studio to view data drift metrics.

  2. Turn to Azure Application Insights to leverage other metrics and insights. This resource is deployed with every Azure Machine Learning workspace.

  3. Data drift detection is currently in public preview, and as with all preview features, it is offered without an SLA and is not recommended for production workloads.

  4. It can be difficult to set up a meaningful proof-of-concept for data drift monitoring if you do not already have a working machine learning model, with non-trivial datasets in production – or at least in the latter stages of training. So consider the code snippets that follow more of a template for planning versus a tutorial for teaching.

Also, as you plan for general monitoring and maintenance of your Azure Machine Learning solutions, be sure to consider how you will monitor and respond to data drift.

Starting requirements

The following instructional snippets of code from the Python SDK for Azure ML assume that you already have: 

  • A workspace

  • A registered tabular dataset, such as a csv file (probably the one you used for training) to use as a baseline

  • A pipeline that is set up to ingest data through another registered dataset, which is your target dataset

  • That the target dataset has the same features as your baseline, plus a date timestamp column, and that dataset contains at least 9 weeks of data

  • An existing compute cluster to run the data drift monitor

Got all that? Good! Let’s go through the steps.

1. Define a data drift monitor

      from azureml.datadrift import DataDriftDetector
# set up a list of features to monitor
mon_features = ['ClaimDollars', 'Age', 'Accidents']


# define data drift detector
my_monitor = DataDriftDetector.create_from_datasets(ws, 'my-data-drift-monitor',
baseline_data_set, target_data_set,
compute_target=my_existing_cluster,
frequency='Week',
feature_list=mon_features,
drift_threshold=.2,
latency=24)
my_monitor

    

2. Backfill data to test your monitor

      from azureml.widgets import RunDetails
# manually feed data to the monitor
bf_data = my_monitor.backfill(dt.datetime.now() - dt.timedelta(weeks=9), dt.datetime.now())


RunDetails(bf_data).show()
bf_data.wait_for_completion()
    

3. View metrics regarding drift

      # Access metrics using code or visit "Dataset monitors" under "Data" in Azure ML studio
snow_drifts = bf_data.get_metrics()
for metric in snow_drifts:
print(metric, snow_drifts[metric])
    

In my opinion, that’s a small amount of code to deliver a lot of value!

Next steps

As I said earlier, it can be difficult to set up a proof of concept for data drift modeling if you don’t have a working ML model first with non-trivial datasets. If you’d like to get started on setting one up — or want to make sure you’ve ticked all the boxes — check out the course I co-authored with Brian Roehm: “DP-100: Designing and Implementing a Data Science Solution on Azure.” As per the name, it’s also a good course for if you’re studying for the Microsoft Azure Data Scientist Associate certification.

If you want to learn more about how to measure your machine learning models against responsible AI principles, read up on Azure’s Responsible AI Dashboard. Pluralsight also offers a range of beginner, intermediate, and expert AI and ML courses — you can sign up for a 10-day free trial with no commitments.

Amy Coughlin

Amy C.

Amy Coughlin is a Pluralsight Author and Senior Azure Training Architect with over 30 years of experience in the tech industry, mainly focused on Microsoft stack services and databases. She's living the dream of combining her love of technology with her passion for teaching others.

More about this author