Coping with Missing, Invalid and Duplicate Data in R
Learn how to deal with missing, invalid and duplicate data in the statistical programming language R, to get your data ready for predictive modeling.
Oct 24, 2019 • 15 Minute Read
Introduction
A vital component of data science is cleaning the data and getting it ready for predictive modeling. The most common problem related to data cleaning is coping with missing data, invalid records and duplicate values.
In this guide, you will learn about techniques for dealing with missing, invalid and duplicate data in the statistical programming language R.
Data
In this guide, we will be using a fictitious dataset of loan applications containing 600 observations and 12 variables:
1. UID - Unique identifier for an applicant
2. Marital_status - Whether the applicant is married ("Yes") or not ("No")
3. Dependents - Number of dependents of the applicant
4. Is_graduate - Whether the applicant is a graduate ("Yes") or not ("No")
5. Income - Annual income of the applicant (in USD)
6. Loan_amount - Loan amount (in USD) for which the application was submitted
7. Term_months - Tenure of the loan
8. Credit_score - Whether the applicants credit score is good ("Satisfactory") or not ("Not Satisfactory")
9. Approval_status - Whether the loan application was approved ("1") or not ("0")
10. Age - The applicant's age in years
11. Sex - Whether the applicant is a male ("M") or a female ("F")
12. Purpose - Purpose of applying for the loan
Let’s start by loading the required libraries and the data.
library(readr)
dat <- read_csv("data_cleaning.csv")
glimpse(dat)
Output:
Observations: 600
Variables: 12
$ UID <chr> "UIDA7", "UIDA354", "UIDA402", "UIDA467", "UIDA78", "U...
$ Marital_status <chr> "No", "Yes", "No", "No", "Yes", "No", "Yes", "Yes", "Y...
$ Dependents <int> NA, 0, 0, 0, NA, NA, NA, NA, 0, 2, 0, 2, 0, 0, NA, NA,...
$ Is_graduate <chr> "Yes", "Yes", "Yes", "Yes", "Yes", "No", "Yes", "No", ...
$ Income <int> 333300, 532400, 454700, 368500, 611100, 266700, 243700...
$ Loan_amount <int> 53500, 115000, 49000, 61000, 62000, 70000, 55500, 6300...
$ Term_months <int> 504, 384, 384, 384, 384, 384, 384, 204, 384, 384, 384,...
$ Credit_score <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satis...
$ approval_status <int> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, ...
$ Age <int> -3, -3, -10, -12, -12, -12, -12, 23, 23, 23, 23, 23, 2...
$ Sex <chr> "M", "M", "F", "F", "F", "M", "M", "M", "M", "M", "M",...
$ Purpose <chr> "Education", "Travel", "Personal", "Business", "Educat...
The output shows that the dataset has six numerical variables (labeled as 'int'), while the remaining six are categorical variables (labelled as 'chr'). We will convert these into 'factor' variables, except for the 'UID' variable, using the line of code below.
names <- c(2,4,8,9,11,12)
dat[,names] <- lapply(dat[,names] , factor)
glimpse(dat)
Output:
Observations: 600
Variables: 12
$ UID <chr> "UIDA7", "UIDA354", "UIDA402", "UIDA467", "UIDA78", "U...
$ Marital_status <fct> No, Yes, No, No, Yes, No, Yes, Yes, Yes, Yes, Yes, Yes...
$ Dependents <int> NA, 0, 0, 0, NA, NA, NA, NA, 0, 2, 0, 2, 0, 0, NA, NA,...
$ Is_graduate <fct> Yes, Yes, Yes, Yes, Yes, No, Yes, No, Yes, Yes, Yes, Y...
$ Income <int> 333300, 532400, 454700, 368500, 611100, 266700, 243700...
$ Loan_amount <int> 53500, 115000, 49000, 61000, 62000, 70000, 55500, 6300...
$ Term_months <int> 504, 384, 384, 384, 384, 384, 384, 204, 384, 384, 384,...
$ Credit_score <fct> Satisfactory, Satisfactory, Satisfactory, Satisfactory...
$ approval_status <fct> 0, 1, 1, 1, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, ...
$ Age <int> -3, -3, -10, -12, -12, -12, -12, 23, 23, 23, 23, 23, 2...
$ Sex <fct> M, M, F, F, F, M, M, M, M, M, M, M, M, F, F, M, M, M, ...
$ Purpose <fct> Education, Travel, Personal, Business, Education, Educ...
We are now ready to carry out the data cleaning steps in the following sections.
Duplicate Values
The first step is to check for duplicate records, one of the most common errors in real world data. Duplicate records increase computation time and decrease model accuracy, and hence must be removed. In our dataset, 'UID' is the unique identifier variable and will be used to drop the duplicate records. The first line of code below uses the duplicated() function to find duplicates, while the second line prints the number of duplicates.
dup_records <- duplicated(dat$UID)
sum(dup_records)
Output:
1] 3
The output shows that there are three duplicate records. We will drop these records using the first line of code below. The second line prints the dimension of the resulting dataset — 597 observations and 12 variables.
dat <- dat[!duplicated(dat$UID), ]
dim(dat)
Output:
1] 597 12
Invalid Values
When we looked at the data using the glimpse() function in the previous section, we realized that the age variable has incorrect entries. Let’s look at the summary of the age variable.
summary(dat$Age)
Output:
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
-12.00 36.50 51.00 49.03 61.00 76.00 2
The output shows that the minimum value of the variable 'Age' is -12. This is impossible and brings us to the next common problem in real world datasets: the presence of inaccurate records. It is safe to assume that for loan applications, the minimum age should be 18 years. This means that we will remove records of applicants below 18 years of age.
The first two of lines of code below give us the number of records in the dataset for which the age is below 18 years. The number of such records is seven, and they are removed with the third line of code. The fourth line prints the dimensions of the new data — 590 observations and 12 variables.
Finally, we relook at the summary of the age variable. This shows that the range of age is now 23 to 76 years, indicating that the correction has been made.
age_18 <- dat[which(dat$Age<18),]
dim(age_18)
dat <- dat[-which(dat$Age<18),]
dim(dat)
summary(dat$Age)
Output:
1] 590 12
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
23.00 37.00 51.00 49.72 61.00 76.00 2
It was easy to detect incorrect entries in the age variable. In other cases, invalid values are in the form of outliers. Let’s look at an example of the 'Income' variable. The summary of the variable shows that the minimum and maximum income levels are USD 136700, and 3321001, respectively. This is a highly skewed range, indicating some extreme values. To better understand the distribution, we use the quantile function, which gives us the first to hundredth percentile values of the variable in the sequence of unit percentile.
summary(dat$Income)
quantile(dat$Income,seq(0,1,0.01))
Output:
Min. 1st Qu. Median Mean 3rd Qu. Max.
136700 386700 512800 687874 775300 3321001
0% 1% 2% 3% 4% 5% 6% 7% 8% 9%
136700 136700 210676 240000 244288 254540 263684 274664 286108 295100
10% 11% 12% 13% 14% 15% 16% 17% 18% 19%
301560 311100 317700 320000 329100 332220 333300 335288 344400 346700
20% 21% 22% 23% 24% 25% 26% 27% 28% 29%
352860 358800 363084 371396 383332 386700 391172 397980 401508 405556
30% 31% 32% 33% 34% 35% 36% 37% 38% 39%
410220 412724 421052 422244 424804 431960 437016 444400 448620 454972
40% 41% 42% 43% 44% 45% 46% 47% 48% 49%
458920 465068 468448 476468 479696 486180 491380 495548 500000 506956
50% 51% 52% 53% 54% 55% 56% 57% 58% 59%
512800 515552 523184 532012 536480 551820 555504 563080 572852 577700
60% 61% 62% 63% 64% 65% 66% 67% 68% 69%
585380 607584 611276 620300 625904 633300 648308 656708 666700 683156
70% 71% 72% 73% 74% 75% 76% 77% 78% 79%
700000 721040 733300 753968 761484 775300 788132 800000 807740 821696
80% 81% 82% 83% 84% 85% 86% 87% 88% 89%
834660 853300 880008 914712 963752 1010680 1058180 1111100 1149276 1219460
90% 91% 92% 93% 94% 95% 96% 97% 98% 99%
1262060 1333300 1392412 1502676 1664032 1944400 2064768 2223884 2608396 3197268
100%
3321001
We can remove the outliers using the method described in the previous section. We can also address them through a different method of flooring and capping the extreme values. The first line of code below does the flooring of the lower outliers at the first percentile value, i.e., USD 136700. Similarly, the second line performs the capping of the higher outliers at the 99th percentile value, i.e., USD 3321001.
The third line of code prints the new summary of the variable, indicating that the correction has been done.
dat$Income[which(dat$Income<136700)]<- 136700
dat$Income[which(dat$Income > 3321001)]<- 3321001
summary(dat$Income)
Output:
Min. 1st Qu. Median Mean 3rd Qu. Max.
136700 386175 508650 685301 772650 3321001
Missing Values
Missing value treatment is the most common data cleaning step performed in a data science project. The line of code below prints the number of missing values in each of the variables.
sapply(dat, function(x) sum(is.na(x)))
Output:
UID Marital_status Dependents Is_graduate Income
0 0 3 6 0
Loan_amount Term_months Credit_score approval_status Age
7 6 0 0 2
Sex Purpose
0 0
The output above shows the presence of missing values across the variables, most of which are numerical variables, except 'Is_graduate', which is a categorical variable.
Missing Value Imputation for Numerical Variables
The most widely used technique for imputing values for a numerical variable is to replace the missing values with the mean or the median value. In the lines of code below, we replace missing values in 'Loan_amount' with the median value, while the missing values in 'Term_months' are replaced by the mean value. The output shows that the missing values have been imputed.
dat$Loan_amount[is.na(dat$Loan_amount)] <- median(dat$Loan_amount, na.rm = TRUE)
table(is.na(dat$Loan_amount))
dat$Term_months[is.na(dat$Term_months)] <- mean(dat$Term_months, na.rm = TRUE)
table(is.na(dat$Term_months))
Output:
FALSE
590
FALSE
590
Missing Value Imputation for Categorical Variables
For categorical variables, it is important to understand the frequency distribution, which can be printed with the line of code below.
table(dat$Is_graduate)
Output:
No Yes
127 457
The output shows that most applicants were graduates, identified with the label 'Yes'. The lines of code below replace the missing values with the highest frequency label, 'Yes'.
dat$Is_graduate[is.na(dat$Is_graduate)] <- "Yes"
table(dat$Is_graduate)
Output:
No Yes
127 463
Removing Rows with Missing Values
We have imputed missing values using measures of central tendency: mean, median and mode. Another technique is to delete rows where any variable has missing values. This is performed using the na.omit() function, which removes all the rows containing missing values.
dat <- na.omit(dat)
dim(dat)
Output:
1] 585 12
The resulting data has 585 observations of 12 variables. We can do a final check to see if all the missing values have been removed using the command below.
sapply(dat, function(x) sum(is.na(x)))
Output:
UID Marital_status Dependents Is_graduate Income
0 0 0 0 0
Loan_amount Term_months Credit_score approval_status Age
0 0 0 0 0
Sex Purpose
0 0
We can look at the data and post all these transformations using the glimpse command below.
glimpse(dat)
Output:
Observations: 585
Variables: 12
$ UID <chr> "UIDA209", "UIDA238", "UIDA256", "UIDA274", "UIDA430",...
$ Marital_status <fct> Yes, Yes, Yes, Yes, No, Yes, No, Yes, Yes, No, No, Yes...
$ Dependents <int> 0, 2, 0, 2, 0, 0, 0, 1, 3, 0, 0, 1, 0, 0, 1, 0, 0, 0, ...
$ Is_graduate <fct> Yes, Yes, Yes, Yes, Yes, No, Yes, Yes, Yes, Yes, Yes, ...
$ Income <dbl> 1984000, 1066700, 834100, 775900, 421100, 402300, 1111...
$ Loan_amount <int> 1070000, 111000, 89000, 1330000, 49500, 56500, 104000,...
$ Term_months <dbl> 384, 384, 384, 384, 384, 384, 384, 384, 384, 384, 384,...
$ Credit_score <fct> Satisfactory, Satisfactory, Not _satisfactory, Satisfa...
$ approval_status <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, ...
$ Age <int> 23, 23, 23, 23, 23, 23, 24, 24, 24, 24, 25, 25, 25, 25...
$ Sex <fct> M, M, M, M, M, F, M, M, M, M, F, M, M, M, M, F, M, M, ...
$ Purpose <fct> Personal, Personal, Personal, Travel, Personal, Person...
Conclusion
In this guide, you have learned methods of dealing with missing, invalid and duplicate data in R. These techniques will help you in cleaning data and making it ready for machine learning. To learn more about data science using 'R', please refer to the following guides: