Understanding Apache Spark
Sep 4, 2020 • 8 Minute Read
Introduction
Whether or not a company or organization handles data efficiently can make an enormous difference to the successful completion of its mission. As a data professional in today's world, odds are that you will be asked to manipulate or uncover insights on very large datasets. So how can you analyze these massive datasets efficiently and elegantly? How can you stream this data across multiple programming languages and environments?
The answer is Spark.
Put simply, Spark is an engine that analyzes data in a distributed fashion. Spark really shines when you are attempting to stream or run analytics on very large datasets.
This guide will give you a high-level overview of what Spark is and does. Spark has a very robust and general-purpose API, so it is beyond the scope of this guide to cover it all in-depth. Instead, we will focus on what Spark provides and the way it is structured.
Let's dive in!
Spark Core: General Architecture
First, let's learn the language of Spark. Here is a breakdown of the general terms that Spark uses in its architecture:
- Master Node: The master Spark process which handles the receiving of user-created Spark applications and communication to the cluster manager.
- Application: This is essentially a user-created program written in Scala, Python, Java, etc. that utilizes one of the many available Spark drivers for a given programming language ecosystem.
- Cluster Manager: This behind-the-scenes manager handles the distribution of computation to the worker nodes for submitted applications. The cluster manager manages Spark resources and assigns applications to worker nodes.
- Worker Node: A worker node is responsible for carrying out a given piece of computation for a given application.
Spark Core is just that—the core of the analytics engine. It comprises the base, Spark API for submitting Spark applications to a running instance of Spark within your environment. Spark Core itself is broken up into the four major pieces noted above. The first of these is the Spark application. A Spark application is something that you write yourself using any one of the available Spark APIs. For example, you may use Spark's Scala API to write code that connects to Spark, queries an available dataset, performs computations, and then spits out a result. Spark applications are submitted to the master node.
The second architectural piece is the master node. This is the process that contains the interface via which you submit your Spark applications. The master node disperses applications to worker nodes via the cluster manager. The cluster manager is a piece of the Spark architecture that you don't interface with directly. The cluster manager is responsible for dispersing work to the worker nodes in an efficient manner, among other things.
Finally, the worker nodes are just that—nodes that work! These are running processes that are configured with a certain amount of system resources and then given work to do by the context manager.
Altogether, Spark uses these pieces to perform extremely fast, failsafe computations in a distributed manner.
Spark Workflow Overview: From Start To Finish
Let's dive into the general workflow of Spark running in a clustered environment. Typically, the first thing that you will do is download Spark and start up the master node in your system. You can start a standalone, master node by running the following command inside of Spark's sbin directory: ./start-master.sh. After the master node has started, you will need to check the URL that is printed out in the shell upon a successful start-up. You will use this URL to start up your worker nodes.
With your master node started up, next you will need to start one or more worker nodes. Remember, these are the nodes that will be performing the actual work! You can start up a worker node by running ./start-slave.sh <master-node-URL>. Now that you have started up a worker, you can go ahead and check out the Spark UI. Spark's UI is located at https://localhost:8080 by default. This UI is extremely helpful as it gives you a window through which you can easily monitor running applications, resources, etc.
Well done. Spark is running! Next, you will need to implement the logic of your Spark application and submit this application to the master node using the spark-submit tool that Spark provides upon setup. This application could be written using Java, Scala, R or Python Spark APIs.
Implementing a Spark application is beyond the scope of this guide, but you can find out more information on how to write a Spark application here. Once your application has been implemented, you will submit it to the Spark master node. You can then monitor your running application from the Spark UI.
Spark Libraries: An Overview
Spark currently contains four libraries that help to tackle a number of other niche problem sets that are in need of a distributed processing solution. These four libraries include:
- Spark Streaming
- Spark SQL
- MLlib
- GraphX
Spark Streaming is a library that Spark provides for processing data streams. Apart from creating your own custom data sources, Spark Streaming gives you the capability to stream data through HDFS (Hadoop Distributed File System), Flume, Kafka, Twitter, and ZeroMQ. The incredible power of Spark Streaming lies in the ability to combine batch processing and analytics on streaming data. Before Spark Streaming, it was often necessary to use different technologies to encompass these capabilities. Spark Streaming allows you to do batch processing of streaming data as well as perform MLlib and Spark SQL workloads on it.
Where Spark Streaming provides an API for processing streaming data, Spark SQL is the Spark API for processing structured data. Spark SQL is available to use in Scala, Java, Python, and R and is what you want to use if you have any sort of structured data that you need to analyze. A good use case for this would be the need to analyze terabytes of JSON data that is being streamed into your system. Remember, you can use this API in tandem with the Spark Streaming API to perform computation on structured data that is being streamed in.
MLlib is Spark's built-in machine learning library. MLlib allows you to perform machine learning using the available Spark APIs for structured and unstructured data. MLlib has a robust API for doing machine learning. It includes classes for most major classification and regression machine learning mechanisms, among other things. For more information on MLlib, check out the MLlib documentation.
GraphX is Spark's API for performing computation and analysis on graph datasets/data sources. GraphX extends the existing Spark SQL APIs so that you can perform graph processing. A great use case for GraphX could be found within computations needing to be performed on a vast social network or other datasets that are graph-based.
Conclusion
In this guide, you gained a high-level overview of Apache Spark. You learned what Spark is and why it is important in today's data-centric world. You then gathered insights concerning the general architecture of Spark and how it is structured as an analytics engine. Finally, you learned about the different Spark libraries which are available and how these libraries have created an entire, open-source ecosystem around Spark that is open to a wide variety of programming environments.
You can now both begin using Spark and, with confidence, understand where Spark and any of its libraries might fit within your own organization or project. For more information, check out the official Spark documentation.