Skip to content

Contact sales

By filling out this form and clicking submit, you acknowledge our privacy policy.
  • Labs icon Lab
  • A Cloud Guru
Google Cloud Platform icon
Labs

Ingesting Data Using AWS Glue

You are a data engineer tasked with migrating some new CSV files into S3. Once there, you need to add the schema to Glue using Glue Crawler. The Data Science team would like to train some new models on a combined table of this information but only using a few select columns. To decrease cost and improve processing time, you need to create a job in Glue that will combine the CSV files together into one table but only including the needed columns. Then, you need to run the job and verify success in S3. After that, you will be ready to inform the Data Science team that the data is ready for training.

Google Cloud Platform icon
Labs

Path Info

Level
Clock icon Intermediate
Duration
Clock icon 30m
Published
Clock icon May 24, 2024

Contact sales

By filling out this form and clicking submit, you acknowledge our privacy policy.

Table of Contents

  1. Challenge

    Prepare the Environment

    1. Create an S3 Bucket
    2. Create a "parking_data" folder
    3. Download Parking-Ticket-2022 data (https://open.toronto.ca/dataset/parking-tickets/)
    4. Upload file: Parking_Tags_Data_2022.000.csv to the "parking_data" folder
  2. Challenge

    Create a Glue Crawler

    1. Create a Glue Crawler to crawl the "parking_data" folder in S3.
    2. Create a Database named "parking"" in AWS Glue.
    3. Create and run an on demand crawler that will crawl the "parking_data" S3 folder.
    4. Create a new policy named "AWSGlueServiceRole-Parking"
  3. Challenge

    Create a Job in Glue

    1. Edit the AWSGlueServiceRole-Parking policy to allow for Read and Write access to AWS Glue.
    2. Create a Visual Job in AWS Glue
    3. Set the newly uploaded S3 file as the source
    4. Create a Transformation that will drop all colums except "date_of_infraction", "infraction_code", and "infraction_description"
    5. Create a data target to move the data back to the "parking_data" folder in S3. Have AWS Glue create a new "results" table and add it to the catalog.
    6. Run the job and verify the file exists in S3.

The Cloud Content team comprises subject matter experts hyper focused on services offered by the leading cloud vendors (AWS, GCP, and Azure), as well as cloud-related technologies such as Linux and DevOps. The team is thrilled to share their knowledge to help you build modern tech solutions from the ground up, secure and optimize your environments, and so much more!

What's a lab?

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Provided environment for hands-on practice

We will provide the credentials and environment necessary for you to practice right within your browser.

Guided walkthrough

Follow along with the author’s guided walkthrough and build something new in your provided environment!

Did you know?

On average, you retain 75% more of your learning if you get time for practice.

Start learning by doing today

View Plans