Querying and Converting Data Types in R
Jul 3, 2020 • 11 Minute Read
Introduction
Working with data is an obvious requirement from data science professionals. The building block of working with data is to understand the most common data types, and acquire the knowledge of processing, querying and converting them. In this guide, you will learn the techniques of querying and converting data types in R.
Data Types
There are several data types in R, and the most integral ones are listed below:
- Characters: Text (or string) values are called characters. Assigning a text value to a variable, 't', will make it a character, as is shown below. You can confirm its type with the class() or typeof() function.
t = "pluralsight"
class(t)
typeof(t)
Output:
1] "character"
[1] "character"
- Numerics: Decimal values like 3.5 are called numerics in R. It is the default computational data type.
N = 3.5
class(N)
Output:
1] "numeric"
The variable N is stored as a numeric value, and not an integer. This can be checked using the is.integer() function.
is.integer(N)
Output:
1] FALSE
- Integers: If you want to create an integer variable, you can use the as.integer() function. Also, all integers are numeric, but the reverse is not true.
i = as.integer(3.1)
print(i)
Output:
1] 3
- Logical: Logical values are often created by comparing two or more variables. These are denoted by boolean values, TRUE or FALSE.
x = 100
y = 56
x < y
Output:
1] FALSE
The most common data types are discussed above, but the most important data type is a data frame.
Data Frame
Data frame is the de-facto data type for most data science projects, as it's organized in tabular format. In simple terms, a data frame is a special type of list where all the elements are of equal length.
Data frames are normally created by read_csv() and read.table() functions when importing the data into R. You can also create a new data frame with the data.frame() function.
df <- data.frame(rollnum = seq(1:10), h1 = 15:24, h2 = 81:90)
df
Output:
rollnum h1 h2
1 1 15 81
2 2 16 82
3 3 17 83
4 4 18 84
5 5 19 85
6 6 20 86
7 7 21 87
8 8 22 88
9 9 23 89
10 10 24 90
The most common method of dealing with a data frame is by importing the flat files--csv or Excel--into the R environment. The code below performs this task and loads the data that will be used in the subsequent sections.
library(readr)
dat <- read_csv("data.csv")
glimpse(dat)
Output:
Observations: 585
Variables: 6
$ UID <chr> "UIDA467", "UIDA402", "UIDA354", "UIDA209", "UIDA256",...
$ Income <dbl> 36850.4, 45470.2, 53240.2, 198400.2, 83410.2, 42110.2,...
$ Credit_score <chr> "Satisfactory", "Satisfactory", "Satisfactory", "Satis...
$ approval_status <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, ...
$ Age <int> -12, -10, -3, 23, 23, 23, 23, 23, 23, 24, 24, 24, 24, ...
$ Purpose <chr> "Business", "Personal", "Travel", "Personal", "Persona...
The output shows there are 585 observations of 6 variables, described below.
-
UID: Unique identifier tag of the loan applicant.
-
Income: Annual income of the applicant (in US dollars).
-
Credit_score: Whether the applicant's credit score was satisfactory or not.
-
approval_status: Whether the loan application was approved ("1") or not ("0").
-
Age: The applicant’s age in years.
-
Purpose: The reason for the loan application.
Inspecting and Converting Data Types
For data science and machine learning, it's important for the variables to be in the right data type. To begin, you will use the str() function that prints the structure of the data.
str(dat)
Output:
Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 585 obs. of 6 variables:
$ UID : chr "UIDA467" "UIDA402" "UIDA354" "UIDA209" ...
$ Income : num 36850 45470 53240 198400 83410 ...
$ Credit_score : Factor w/ 2 levels "Not _satisfactory",..: 2 2 2 2 1 2 2 2 2 2 ...
$ approval_status: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ Age : int -12 -10 -3 23 23 23 23 23 23 24 ...
$ Purpose : Factor w/ 6 levels "Business","Education",..: 1 4 5 4 4 4 4 4 5 4
From the output above, you can see that the data has six variables, three numerical and three categorical. You will start by understanding the levels of character variables.
table(dat$Credit_score)
Output:
Not _satisfactory Satisfactory
124 461
The variable Credit_score has only two levels, so it can be converted to a factor variable with the as.factor() function.
dat$Credit_score = as.factor(dat$Credit_score)
class(dat$Credit_score)
Output:
1] "factor"
Next, inspect the number of levels for the variable Purpose.
table(dat$Purpose)
Output:
Business Education Furniture Personal Travel Wedding
43 184 37 161 122 38
There are six levels in the variable Purpose which is converted to the factor data type with the code below.
dat$Purpose = as.factor(dat$Purpose)
class(dat$Purpose)
Output:
1] "factor"
The last conversion to make is for the variable approval_status. Start by examining the class of the variable.
class(dat$approval_status)
table(dat$approval_status)
Output:
1] "integer"
0 1
186 399
The class of the variable approval_status is shown as integer, but it takes only two values, zero and one. In fact, this is a categorical variable and needs to be converted to factor.
dat$approval_status = as.factor(dat$approval_status)
class(dat$approval_status)
Output:
1] "factor"
The required conversions have been made, and this can be verified with the code below.
glimpse(dat)
Output:
Observations: 585
Variables: 7
$ UID <chr> "UIDA467", "UIDA402", "UIDA354", "UIDA209", "UIDA256",...
$ Income <dbl> 36850.4, 45470.2, 53240.2, 198400.2, 83410.2, 42110.2,...
$ Credit_score <fct> Satisfactory, Satisfactory, Satisfactory, Satisfactory...
$ approval_status <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, ...
$ Age <int> -12, -10, -3, 23, 23, 23, 23, 23, 23, 24, 24, 24, 24, ...
$ Purpose <fct> Business, Personal, Travel, Personal, Personal, Person...
You have inspected and converted the variables in the section above, and will learn how to query some of the numerical variables. The summary() function provides key statistics about the variables.
summary(dat)
Output:
UID Income Credit_score approval_status
Length:585 Min. : 3000 Not _satisfactory:124 0:186
Class :character 1st Qu.: 38890 Satisfactory :461 1:399
Mode :character Median : 51440
Mean : 71655
3rd Qu.: 77570
Max. :844490
Age Purpose
Min. :-12.00 Business : 43
1st Qu.: 37.00 Education:184
Median : 51.00 Furniture: 37
Mean : 49.39 Personal :161
3rd Qu.: 61.00 Travel :122
Max. : 76.00 Wedding : 38
From the output above, you can see that the variable, Age, has negative values. This is incorrect data and needs further querying. There are various ways to do it, one of which is to find out how many such values are there.
neg_age = dat[dat$Age<0,]
nrow(neg_age)
Output:
1] 3
There are only three such records and deleting them won't make any difference. However, the other technique can be to create a new logical variable that will check the condition of age being negative.
The first line uses the ifelse() command to create a new variable AgeNegative, that returns a value TRUE if the expression is correct. Otherwise it returns a FALSE. The second line prints the first five values of the variable.
dat$AgeNegative <-ifelse(dat$Age < 0, "TRUE", "FALSE")
dat$AgeNegative[1:5]
Output:
1] "TRUE" "TRUE" "TRUE" "FALSE" "FALSE"
The output above shows that the first three values are TRUE, which indicates the three negative age values of the data. In the similar manner, you can inspect other variables in the data.
Conclusion
In this guide, you learned about the most common data types, and acquired the knowledge of querying and converting them. This will help you understand and transform data better to perform complex data science tasks.
To learn more about Data Science with R, please refer to the following guides: