Skip to content

Contact sales

By filling out this form and clicking submit, you acknowledge our privacy policy.
  • Labs icon Lab
  • A Cloud Guru
Azure icon
Labs

Wrangle Data with Python in Azure Machine Learning

In this lab, you will use a notebook and Python code to retrieve a small sample of data to load into a pandas DataFrame. You will then wrangle the data — that is, cleanse, transform, and clean it, interactively, until you are satisfied that it is in good enough for use in your machine learning model. Students who have prior experience with Python, in general, and the pandas library, specifically, will have the best opportunity to complete this lab with minimal assistance. However, any student motivated to read code documentation can combine that research with help from the lab guide and the solution video.

Azure icon
Labs

Path Info

Level
Clock icon Intermediate
Duration
Clock icon 45m
Published
Clock icon Mar 05, 2024

Contact sales

By filling out this form and clicking submit, you acknowledge our privacy policy.

Table of Contents

  1. Challenge

    Housekeeping

    You should already be logged in to the Azure portal. If you have problems, be sure to use an InPrivate or Incognito window in your browser, and be sure to use the login credentials provided with the lab.

    1. After you have logged in, from the resource group already deployed for you, select the Azure Machine Learning workspace, also already deployed for you.

    2. From the workspace, launch Azure Machine Learning studio; a separate tab will open.

    3. Once in the studio, choose Notebooks, and add a new file to open a new notebook. Name the file anything you like, and leave the file type defaulted.

    4. At the top, next to Compute, confirm that the compute instance deployed for you is running. Start it, if necessary, but do not create a new instance or try to use the serverless Spark option. If you do, the lab will fail.

    5. When prompted, authenticate/validate your connection to the compute instance.

    6. Add a line of sample code to the first notebook cell:

      print("Hello World")
      
    7. Confirm the kernel is running by clicking the Run button to the left of the cell to execute the cell. The run should complete in less than a few seconds.

  2. Challenge

    Wrangle Data in Python Code

    Suggestions Before You Start

    • Create a new Code cell in your notebook for each coding task in order to wrangle iteratively and interactively.

    • Name your DataFrame my_dataframe to more easily check your work and understand the code hints.

    • Enter the code, below, at the end of each cell to return all the rows in the dataset. This will enable you to study the data and any changes you've made to it.

      my_dataframe.head(1000)
      
    • Using any method to modify one or more rows in the data, pass in the inplace=True parameter to make the change in memory in the dataset, without needing to return any data. For example:

      my_dataframe.someMethod(aParameter, anotherParameter, inplace=True)
      

    Coding Tasks

    1. Import the pandas library and read the contents of a csv file into a DataFrame. The csv file is located on GitHub, accessible at a URL provided in the Additional Resources section of this lab. View the DataFrame contents.
    2. For some of the rows in the dataset, two status columns have None or NaN, instead of values, indicating empty, or NULL columns in those rows. Replace the empty values with the word "Unknown."
    3. Assume that you have cleaned up missing values as much as you can. Delete any rows in the dataset that still have any empty columns.
    4. You should note three exact duplicate rows in the data. Keep the first occurrence of the duplicate set, and delete the other two.
    5. BONUS TASK: The de-duping process will cause the rows to land out of original order, which makes it harder to spot the remaining row from the set of duplicates. Execute a method that restores the order based on the index (the sequence number on the far left of each row).
    6. Save the DataFrame to a new csv file located in the same folder as the notebook you created at the beginning of the lab.

    Coding Hints

    Spoiler alert: These hints might provide you with more help than you need. If they provide you with less help than you need, then it is probably time to visit the lab guide or solution video. Coding hint numbers refer to the coding task numbers.

    1. The import statement is import pandas as pd, and the DataFrame method is read_csv.
    2. The DataFrame method is fillna.
    3. The DataFrame method is dropna.
    4. The DataFrame method is drop_duplicates
    5. The bonus re-sorting DataFrame method is sort_index.
    6. The method to persist data is to_csv.

The Cloud Content team comprises subject matter experts hyper focused on services offered by the leading cloud vendors (AWS, GCP, and Azure), as well as cloud-related technologies such as Linux and DevOps. The team is thrilled to share their knowledge to help you build modern tech solutions from the ground up, secure and optimize your environments, and so much more!

What's a lab?

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Provided environment for hands-on practice

We will provide the credentials and environment necessary for you to practice right within your browser.

Guided walkthrough

Follow along with the author’s guided walkthrough and build something new in your provided environment!

Did you know?

On average, you retain 75% more of your learning if you get time for practice.

Start learning by doing today

View Plans