Manipulating String Data in R
Most semi-structured and unstructured data is stored using strings, so you’ll need to deal with string manipulation for data analysis or mining.
Sep 6, 2019 • 10 Minute Read
Introduction
This guide will help you understand string manipulation in R. Most of the semi-structured and unstructured data is stored using strings, so you’ll need to deal with string manipulation for data analysis or mining. R provides built-in functions for case conversion, combine, length, and subset for manipulating strings. Stingr from tidyverse package is popular choice, as all string functions begin with str and are easy to remember; we will review some of these functions. Let us start by installing tidyverse package.
install.pacakages(tidyverse)
library(tidyverse)
library(stringr)
Performing Simple String Operations
Define a String
To make strings in R ,you can use a single quote, double quotes, and character(). However, character() will create a vector of type character.
myquote <- “Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures"
myquote <- ‘Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures’
myquote = character(0)
myquote[1] = "Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures"
Create an Empty String
This is used for creating empty strings because these are not fixed, we can provide values later.
myquote = character(0)
myquote <- ‘’
myquote <- “”
Display Length of String
String length needs to be checked for various purposes like: -Compare two strings -Find the longest or shortest string
- Applying format to strings
Let us review length(), nchar(), and str_length from stringr.
Length()
>length(myquote)
Output:
`[1] 1`
For the above string, since R stores data as vectors, the length function returns “1” for the index[1] .
nchar()
>nchar(myquote)
Output:
`[1] 136`
nchar counts the total characters in the string.
str_length()
> str_length(myquote)
Output:
`[1] 136`
str_length() returns the number of code points in a string. Generally, one code point is one character, but not always.
Combine Two Strings with c() and str_c()
At times, we need to add a string to an existing string. For example, the quote mentioned above in my quote string does not contain a name or identifier. Let’s try to add this as a string.
Add a String Using the c() Combine Function
>myquote <-c(myquote, "-John F. Kennedy")
Output:
`[1] "Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures"
[2] "-John F. Kennedy"`
This stores the data as two combined strings with individual character counts.
Add a String Using the str_c() Combine Function
> str_c(myquote, "-John F. Kennedy", sep= "",collapse =NULL )
Output:
`[1] "Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures-John F. Kennedy"`
```]
You can use the sep argument to specify how the strings are separated. Since str_c() creates a vector, it automatically recycles a shorter vector to the size of the longest element.
### Subset a String
In order to extract parts of strings, you can use the substr() or the str_sub(). This is helpful in cases like date and time stored together as a string and you need to extract only the date part of the data. Both of the functions require the start and end of the string to be extracted.
```r
>substr(myquote,17,45)
Output:
`[1] ", a weekly, a monthly process" ""`
> str_sub(myquote,start=17,end=45)
Output:
`[1] ", a weekly, a monthly process"`
To split the elements of a string into substrings based on matches to a given pattern:
> strsplit(myquote,"slowly")
Output:
`[1] "Peace is a daily, a weekly, a monthly process, gradually changing opinions, "
[2] " eroding old barriers, quietly building new structures"`
> str_split(myquote,"slowly")
Output:
`[1] "Peace is a daily, a weekly, a monthly process, gradually changing opinions, "
[2] " eroding old barriers, quietly building new structures" `
In this example, the string myquote gets split into a two-character vector where the word “slowly” is encountered.
Find and Replace Functions
To find a string, you can use grep, grepl(), regexpr(), gregexpr(), and regexec() functions. These differ in the format and details in the results. To perform a replacement of the first match only, use sub() and for replacing all the matches, use gsub().
The example below gsub() replaces all the spaces with “-“ and str_replace_all() replaces all the “-“ with spaces.
> gsub(" ", "-",myquote)
Output:
`[1] "Peace-is-a-daily,-a-weekly,-a-monthly-process,-gradually-changing-opinions,-slowly-eroding-old-barriers,-quietly-building-new-structures"`
> str_replace_all(myquote,"-"," ")
Output:
`[1] "Peace is a daily, a weekly, a monthly process, gradually changing opinions, slowly eroding old barriers, quietly building new structures"`
Formatting Strings
Now we will discuss formatting. R provides C-style formatting which means that we use a wrapper for C library functions. Let us see an example using the sprint() that replaces a format with a given string or number. The parameters used here are %s for string and %.2f for a fixed-point decimal value. You can find more information in the resources section.
> sprintf("Your device %s is at %.2f percent energy efficient", "Thermostat", 67.700)
Output:
`[1] "Your device Thermostat is at 67.70 percent energy efficient"`
Pattern Matching
Let's review the regular expressions, a method of describing patterns. For example, if I want to find all states starting with the letter “a” in the USArrests data set, I can set a pattern match as below:
#install rebus to specify anchors START and END
install.packages("rebus")
library(rebus)
# Find states starting with letter A
states = rownames(USArrests)
str_view(states, pattern = START %R% "A")`
Similarly, to find all the states ending with “a”:
> str_view(states, pattern = "a" %R% END )
Conclusion
To conclude, this guide provides you with basic functions to get started on string manipulations. I have created a list of a few more functions that you can use; refer to the resources section for further explanations.
Check out the table below for base R functions:
Task | Function to use |
---|---|
Convert to uppercase | toupper(x) |
Convert to lowercase | toLower(x, keep_acronyms = FALSE, ...) |
Join multiple vectors | paste (…, sep = " ", collapse = NULL) |
Join elements of a vector together | paste(x, collapse = ' ') |
Find regular expression matches in x returns a vector of indices that contain the pattern | grep(pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE) |
Find regular expression matches in x returns True is the pattern is found. | grepl((pattern, x, ignore.case = FALSE, perl = FALSE, value = FALSE, fixed = FALSE, useBytes = FALSE, invert = FALSE) |
Replace matches | gsub(pattern, replacement, x, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE) |
Converts to character string (x is object of class fingerprint, featvec or feature) | as.character(x) |
Checks for string data types | is.character(x) |
Abbreviate text | abbreviate(names.arg, minlength = 4, use.classes = TRUE, dot = FALSE, strict = FALSE method = c("left.kept", "both.sides"), named = TRUE) |
Enable retrieval of matching substrings | gregexpr(pattern, text, ignore.case =FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE) |
Case folding | casefold(x, upper = FALSE) |
Character translation | chartr(old, new, x) |
Convert to integer value of same length as text | regexec(pattern, text, ignore.case = FALSE, perl = FALSE, fixed = FALSE, useBytes = FALSE) |
Check out the table below for stringr functions and their usage:
Task | Function to use |
---|---|
Convert to uppercase | str_to_lower(string, locale = "en") |
Convert to lowercase | str_to_upper(string, locale = "en") |
Convert to title case | str_to_title(string, locale = "en") |
Convert to sentence case | str_to_sentence(string, locale = "en") |
Match exact string | str_view(string, pattern, match = NA) or str_view_all(string, pattern, match = NA) |
Duplicate a string | str_dup(string, times) |
Remove white spaces | str_trim(string, side = c("both", "left", "right")) or str_squish(string) |
Wrap text | str_wrap(string, width = 80, indent = 0, exdent = 0) |
Vectorized over string | str_count(string, pattern = "") |
View or override current encoding | str_conv(string, encoding) |
Order a character vector | str_sort(x, increasing = TRUE, ignore.case = FALSE, USE.NAMES = FALSE) |
Check out my guides on visualizations with R: