Skip to content

Contact sales

By filling out this form and clicking submit, you acknowledge our privacy policy.

Big data terminology | Pluralsight

If you’re wondering about the differences between data lakes and data warehouses (or other Big Data terminology), you are not alone. Here's an overview.

Jul 13, 2022 • 3 Minute Read

If you’re wondering about the differences between “data lakes” and “data warehouses” (or other Big Data terminology), you are not alone. In recent months, Pluralsight has received a growing number of queries on data-related topics. This brief overview provides a starting point for planning your Big Data training strategy.

What is Big Data?

Data is growing at a very fast pace. Check out Internet Live Stats, which displays real-time internet activity. Looking at these numbers, it is evident that traditional systems are not able to cope with this massive amount of data.

Relational databases and mainframes handle only a limited amount of structured data, measured in gigabytes or small numbers of terabytes. In contrast, Big Data involves acquiring, storing, processing, analyzing, and gaining actionable insights from massive amounts of data—terabytes and beyond. Companies use Big Data to create a competitive advantage.

What’s driving the interest in Big Data?

Picture this: A business user wants to review the last six months of sales results for a specific region. She enters a query into the sales system, which then accesses the requested information from a data warehouse.

In order to get an answer…

  1. The requested information must be in the warehouse.
  2. The user must ask a question that the sales system recognizes.

Suppose this same business user wants a different piece of information that is not currently stored in the warehouse? Or asks a question that the system does not recognize?

For example, what if the user wants to drill down and find the sales results for a particular store in the last six months? If that information is not in the data warehouse, the user would not receive an answer.

Before Big Data, making the changes to get this store-level sales information was not a simple undertaking. Implementing a change request like this could impact 50+ enterprise systems, requiring an intensive time and resource investment.

To remain competitive, organizations need fluid, nimble ways to access and analyze data. Companies cannot afford to wait months or more for answers to pressing questions.

Big Data terminology: What’s the difference between a data lake and a data warehouse?

A data warehouse stores business data in a structured way (e.g. relational database tables with well-defined structures/schema). Because warehousing is expensive, organizations limit what they store. They typically utilize a warehouse for very specific use cases, such as historical sales data for analysis and forecasting.

In contrast, a data lake can store all types of data at a centralized location. Format doesn’t matter. In addition to structured data, a data lake can accommodate semi-structured data (e.g. XML, sensor-based data) and unstructured data (e.g. images, videos, emails).

Organizations can build data lakes with inexpensive storage to store all enterprise data, as well as external data sources (industry information, social media, and so forth). A data lake enables business users and data analysts to ask an infinite number of questions without waiting for the data to be available via a change request process.

Big Data terminology: The roles

If you’re new to Big Data, the job titles can be confusing. Here are some important distinctions:

These three roles have distinct training needs. While some consider data engineering as a subdomain of software engineering, it requires mastery of different skills and tools.

Also, the work of data scientists can vary widely. For example, data scientists may perform a one-off analysis for a team that wants a better understanding of customer behavior. Or, they may develop machine learning algorithms that software engineers or data engineers implement into the code base.

Bhavuk C.

Bhavuk Chawla teaches Big Data, Machine Learning, and Cloud Computing courses for DevelopIntelligence, a Pluralsight Company. As well, he is an official instructor for Google, Cloudera, and Confluent. For the past ten years, he’s helped implement AI, Big Data Analytics, and Data Engineering projects as a practitioner, utilizing Cloudera/Hortonworks Stack for Big Data, Apache Spark, Confluent Kafka, Google Cloud, Microsoft Azure, Snowflake, and more. He brings this hands-on experience, coupled with more than 25 Data/Cloud/Machine Learning certifications, to each course he teaches. Chawla has delivered knowledge-sharing sessions at Google Singapore, Starbucks Seattle, Adobe India, and many other Fortune 500 companies.

More about this author