Validating Data Using Asserts in R
Mar 2, 2020 • 13 Minute Read
Introduction
The quality of data plays a crucial role in machine learning. Without good data, errors are generated that adversely affect data analysis and model performance results. Often, these errors are difficult to detect and occur late in the analysis. Still worse, sometimes errors remain undetected and flow in to the data, producing inaccurate results. The solution to this problem lies in data validation. Enter asserts, debugging aids that test a condition and are used to programmatically check data.
In this guide, you will learn to validate data using asserts in R. Specifically, we'll be using the Assertr package, which provides variety of functions designed to verify assumptions about data early in a data analysis pipeline.
Data
In this guide, we'll be using a fictitious dataset of loan applicants containing 600 observations and 10 variables, as described below:
-
Marital_status: Whether the applicant is married ("Yes") or not ("No").
-
Is_graduate: Whether the applicant is graduate ("Yes") or not ("No").
-
Income: Annual Income of the applicant (in USD).
-
Loan_amount: Loan amount (in USD) for which the application was submitted.
-
Credit_score: Whether the applicants credit score is satisfactory or not.
-
approval_status: Whether the loan application was approved ("Yes") or not ("No").
-
Age: The applicant's age in years.
-
Sex: Whether the applicant was a male ("M") or a female ("F").
-
Dependents: Number of dependents in the applicant's family.
-
Purpose: Purpose of applying for the loan.
Let's start by loading the required libraries and the data.
library(readr)
library(assertr)
library(assertive)
library(magrittr)
library(dplyr)
dat <- read_csv("dataset.csv")
dim(dat)
Output:
1] 600 10
Importance of Asserts
The example below demonstrates the importance of asserts, in which we summarize the average age of the applicants grouped-by their approval status. The first line of code below converts the approval_status variable into a factor, while the second line performs the required computation.
dat$approval_status = as.factor(dat$approval_status)
dat %>%
group_by(approval_status) %>%
summarise(avg_age=mean(Age))
Output:
approval_status avg_age
<fctr> <dbl>
0 47.40000
1 48.61463
There does not seem to be anything wrong in the above output, but let's look at summary function for the Age variable.
summary(dat$Age)
Output:
Min. 1st Qu. Median Mean 3rd Qu. Max.
-10.00 36.00 50.00 48.23 61.00 76.00
From the output above, we can see that some of the applicants' ages are negative, which is not possible. This is incorrect data, but this error was not detected in the previous code where we performed the group-by operation. This is where the Assertr’s verify() function can be used to ensure that such mistakes don't go unidentified.
The verify function takes a data frame (dat) and a logical expression (Age >= 0). Then, it evaluates that expression for the provided data. If the condition of the expression is not met, verify raises an error alert and terminates further processing of the code pipeline. In this example, the lines of code below will perform this task.
dat %>%
verify(Age >= 0) %>%
group_by(approval_status) %>%
summarise(avg_age=mean(Age))
Output:
verification [Age >= 0] failed! (10 failures)
verb redux_fn predicate column index value
1 verify NA Age >= 0 NA 1 NA
2 verify NA Age >= 0 NA 2 NA
3 verify NA Age >= 0 NA 3 NA
4 verify NA Age >= 0 NA 4 NA
5 verify NA Age >= 0 NA 193 NA
6 verify NA Age >= 0 NA 194 NA
7 verify NA Age >= 0 NA 195 NA
8 verify NA Age >= 0 NA 199 NA
9 verify NA Age >= 0 NA 209 NA
10 verify NA Age >= 0 NA 600 NA
Error: assertr stopped execution
The output shows ten instances where the age takes negative values, highlighted by the index. Finally, the error message Error: assertr stopped execution shows that the execution was stopped, which is why the desired output was not displayed.
The same task can be performed using Assertr’s assert() function. In the code below, the assert() function takes the data, dat, and applies a predicate function, within_bounds(0,Inf). We have set the range to only include positive values, but this can be altered as necessary. The next step is to apply the predicate function to the column of interest, Age. The code below raises the error alert when the condition is not met.
dat %>%
assert(within_bounds(0,Inf), Age) %>%
group_by(approval_status) %>%
summarise(avg_age=mean(Age))
Output:
Column 'Age' violates assertion 'within_bounds(0, Inf)' 10 times
verb redux_fn predicate column index value
1 assert NA within_bounds(0, Inf) Age 1 -2
2 assert NA within_bounds(0, Inf) Age 2 -3
3 assert NA within_bounds(0, Inf) Age 3 -4
4 assert NA within_bounds(0, Inf) Age 4 -5
5 assert NA within_bounds(0, Inf) Age 193 -5
[omitted 5 rows]
Error: assertr stopped execution
The first line of the output, Column 'Age' violates assertion 'within_bounds(0, Inf)' 10 times, indicates that there are ten rows with negative age values.
Combining Several Asserts
It can be a time consuming and inefficient to validate data points one at a time using asserts. A more efficient way is to use the family of assert functions and create a chain of such commands for data validation, as shown in the example below.
Let's assume we want to validate the following conditions in our data.
-
The data has all the ten variables described in the initial section of the guide. This is achieved with the verify(has_all_names()) command in the code below.
-
The dataset contains atleast 120 observations, which represents twenty percent of the initial data. This is achieved with the verify((nrow(.) > 120)) command below.
-
The variable Age only takes positive values. This is achieved with the verify(Age > 0) command below.
-
The variables Income and Loan_amount should have values within three standard deviations of their respective means. This is achieved with the insist(within_n_sds(3), Income) command in the code below.
-
The target variable, approval_status, contains only the binary values zero and one. This is achieved with the assert(in_set(0,1), approval_status) command in the code below.
-
Each row in the data contains at most six missing records. This is achieved with the assert_rows(num_row_NAs, within_bounds(0,6), everything()) command below.
-
Each row is unique jointly between the Income, Dependents, approval_status, Age, Sex, Purpose, Loan_amount, and Credit_score variables. This is achieved with the assert_rows(col_concat, is_uniq,...) command below.
dat %>%
verify(has_all_names("Loan_amount", "Income", "Marital_status", "Dependents", "Is_graduate", "Credit_score", "approval_status", "Age", "Sex", "Purpose")) %>%
verify(nrow(.) > 120) %>%
verify(Age > 0) %>%
insist(within_n_sds(3), Income) %>%
insist(within_n_sds(3), Loan_amount) %>%
assert(in_set(0,1), approval_status) %>%
assert_rows(num_row_NAs, within_bounds(0,6), everything()) %>%
assert_rows(col_concat, is_uniq, Income, Dependents, approval_status, Age, Sex, Purpose, Loan_amount, Credit_score) %>%
group_by(approval_status) %>%
summarise(avg.Age=mean(Age))
Output:
verification [Age > 0] failed! (10 failures)
verb redux_fn predicate column index value
1 verify NA Age > 0 NA 1 NA
2 verify NA Age > 0 NA 2 NA
3 verify NA Age > 0 NA 3 NA
4 verify NA Age > 0 NA 4 NA
5 verify NA Age > 0 NA 193 NA
6 verify NA Age > 0 NA 194 NA
7 verify NA Age > 0 NA 195 NA
8 verify NA Age > 0 NA 199 NA
9 verify NA Age > 0 NA 209 NA
10 verify NA Age > 0 NA 600 NA
Error: assertr stopped execution
The output shows that the first two requirements are met but the execution was halted in the third condition with the variable,Age taking negative values. Let's make this correction and create a new data frame, dat2, which only takes positive age values. This is done using the code below.
dat2 <- dat %>%
filter(Age > 0)
dim(dat2)
Output:
1] 590 10
The resulting data has 590 observations because ten rows containing negative values of age were removed. We'll recheck the combination of the data conditions, specified above, using the code below.
dat2 %>%
verify(has_all_names("Loan_amount", "Income", "Marital_status", "Dependents", "Is_graduate", "Credit_score", "approval_status", "Age", "Sex", "Purpose")) %>%
verify(nrow(.) > 120) %>%
verify(Age > 0) %>%
insist(within_n_sds(3), Income) %>%
insist(within_n_sds(3), Loan_amount) %>%
assert(in_set(0,1), approval_status) %>%
assert_rows(num_row_NAs, within_bounds(0,6), everything()) %>%
assert_rows(col_concat, is_uniq, Income, Dependents, approval_status, Age, Sex, Purpose, Loan_amount, Credit_score) %>%
group_by(approval_status) %>%
summarise(avg.Age=mean(Age))
Output:
Column 'Income' violates assertion 'within_n_sds(3)' 7 times
verb redux_fn predicate column index value
1 insist NA within_n_sds(3) Income 190 3173700
2 insist NA within_n_sds(3) Income 255 5219600
3 insist NA within_n_sds(3) Income 321 5333200
4 insist NA within_n_sds(3) Income 324 6901700
5 insist NA within_n_sds(3) Income 344 8444900
[omitted 2 rows]
The output shows that now there is no error alert for negative age values, since those were dropped. Instead, the insist() function found seven records where the Income variable was not within three standard deviations from the mean. The output also prints the index of these records, making it easier for us to treat them as outliers. In this way, we can go on validating the data assumptions and incorporating required corrections if needed.
Conclusion
In this guide, you have learned methods of validating data using asserts in R. You have applied these assertions using two functions, verify() and assert(). This knowledge will help you perform proper data validation, resulting in better data science and analytics results.
To learn more about Data Science with R, please refer to the following guides: