Skip to content

Contact sales

By filling out this form and clicking submit, you acknowledge our privacy policy.
  • Labs icon Lab
  • A Cloud Guru
Labs

Create a Streaming Data Pipeline on GCP with Cloud Pub/Sub, Dataflow, and BigQuery

This lab will simulate live highway sensor data which will be published to a Cloud Pub/Sub topic. Then, a Cloud Dataflow streaming pipeline will subscribe to it. The pipeline will take the streaming sensor data, transform it, and insert it into a BigQuery table. We will then view the streaming inserts in BigQuery while they are in progress, and attempt to gain some useful insights from the streaming data.

Labs

Path Info

Level
Clock icon Advanced
Duration
Clock icon 45m
Published
Clock icon Aug 30, 2019

Contact sales

By filling out this form and clicking submit, you acknowledge our privacy policy.

Table of Contents

  1. Challenge

    Prepare Your Environment

    Enable Pub/Sub and Dataflow APIs:

    gcloud services enable dataflow.googleapis.com
    gcloud services enable pubsub.googleapis.com
    

    Create a Cloud Storage bucket for Dataflow staging:

    gsutil mb gs://$DEVSHELL_PROJECT_ID
    

    Download the GitHub repository used for lab resources:

    cd ~
    git clone https://github.com/ACloudGuru-Resources/googledataengineer
    
  2. Challenge

    Create a Pub/Sub Topic

    gcloud pubsub topics create sandiego
    
  3. Challenge

    Create a BigQuery Dataset to Stream Data Into

    Create a BigQuery dataset to stream data into:

    bq mk --dataset $DEVSHELL_PROJECT_ID:demos
    

    The table will be named average_speeds. We do not create the table, but Dataflow will create it within the dataset for us.

  4. Challenge

    View the Dataflow Template

    We will not be interacting with the template directly. We will be using a script that will install the Java environment and execute the template as a Dataflow job:

    vim googledataengineer/courses/streaming/process/sandiego/src/main/java/com/google/cloud/training/dataanalyst/sandiego/AverageSpeeds.java
    
  5. Challenge

    Create the Dataflow Streaming Job

    Go to the Dataflow job script directory:

    cd ~/googledataengineer/courses/streaming/process/sandiego 
    

    Execute the script that creates the Dataflow streaming job, and subscribe to the Pub/Sub topic.

    This script passes along the Project ID, staging bucket (also the Project ID), and the name of the Java template to use:

    ./run_oncloud.sh $DEVSHELL_PROJECT_ID $DEVSHELL_PROJECT_ID AverageSpeeds
    

    When complete, the streaming job will be subscribed to our Pub/Sub topic, and waiting for streaming input from our simulated sensor data.

  6. Challenge

    Publish Simulated Traffic Sensor Data to Pub/Sub via a Python Script and Pre-Created Dataset

    Browse to the Python script directory:

    cd ~/googledataengineer/courses/streaming/publish
    

    Install any requirements for the Python script:

    pip install -U google-cloud-pubsub
    

    Download the simulated sensor data:

    gsutil cp gs://acg-gcloud-course-resources/sandiego/sensor_obs2008.csv.gz .
    

    Execute the Python script to publish simulated streaming data to Pub/Sub:

    ./send_sensor_data.py --speedFactor=60 --project=$DEVSHELL_PROJECT_ID
    
  7. Challenge

    View the Streamed Data in BigQuery

    In BigQuery, execute the following query to view the current streamed data, both in the table and in the streaming buffer:

    SELECT *
    FROM `demos.average_speeds` LIMIT 1000
    

    Notice the total count of records at the bottom. Wait about a minute and run the same query again (be sure to uncheck use cached results in query options) and notice that the number has increased.

  8. Challenge

    Use Aggregated Queries to Gain Insights

    Let's get some use out of this data. If we wanted to forecast some necessary road maintenance, we would need to know which lanes have the most traffic, to know which ones will require resurfacing first.

    Enter the following query to view which lanes have the most sensor counts:

    SELECT lane, count(lane) as total
    FROM `demos.average_speeds`
    GROUP BY lane
    ORDER BY total DESC
    

    We can also view which lanes have the highest average speeds:

    SELECT lane, avg(speed) as average_speed
    FROM `demos.average_speeds`
    GROUP BY lane
    ORDER BY average_speed DESC
    

The Cloud Content team comprises subject matter experts hyper focused on services offered by the leading cloud vendors (AWS, GCP, and Azure), as well as cloud-related technologies such as Linux and DevOps. The team is thrilled to share their knowledge to help you build modern tech solutions from the ground up, secure and optimize your environments, and so much more!

What's a lab?

Hands-on Labs are real environments created by industry experts to help you learn. These environments help you gain knowledge and experience, practice without compromising your system, test without risk, destroy without fear, and let you learn from your mistakes. Hands-on Labs: practice your skills before delivering in the real world.

Provided environment for hands-on practice

We will provide the credentials and environment necessary for you to practice right within your browser.

Guided walkthrough

Follow along with the author’s guided walkthrough and build something new in your provided environment!

Did you know?

On average, you retain 75% more of your learning if you get time for practice.

Start learning by doing today

View Plans