Working with Formatted Text Files in R
May 4, 2020 • 8 Minute Read
Introduction
In this guide you will learn about the facilities R provides to work with formatted text files. Working with files is a very common task, especially in R. Most of the time data scientists have huge amounts of data on network shares or hard disks. Understanding how to access them and process these files is crucial, because the large amount of data to be processed is coupled with long runtime of scripts. The more efficient you are handling files, the more time you can save with your optimized code. First we will clarify what is meant by "formatted text file," then work with the interfaces provided in R.
Common Text Files
These formats will be fairly familiar to you:
- TXT
- CSV
- JSON (JavaScript Object Notation)
- XML (Extensible Markup Language)
These types are the most common ones for storing unstructured or structured data. When you are using a common file format with unstructured data you need to make sense of it. This means you need to understand each piece of information in those files and adjust your app accordingly. The situation is much easier when the data is structured. Unstructured data and regular-expression walk hand in hand; the regular-expressions allow you to parse out meaningful information while keeping the resource consumption relatively low. For more information on regular-expressions, check out this resource.
Prerequisite
The default R installation has no packages supporting your activity, but its versatile package repository allows you to add this functionality. You need to grab and install the readtext package. After firing up the R console, issue the following command:
install.packages("readtext")
This way, the latest stable version is installed on your system. If you like to experiment with the newest version and its functionality you can install it from Github. The following commands will do that for you:
install.packages("devtools")
devtools::install_github("quanteda/readtext")
To be able to install bleeding edge packages you need the devtools package installed, then the syntax below that line is the real deal. What happens here is that the following prefix is added to https://github.com/, and if you insert the URL https://github.com/quanteda/readtext in your browser it will take you to the source files for the package.
Action
In this section you are going to use the data available from the US about stolen guns. It is enough to download some CSV files and place them into the same folder. Spin up the R console and load the readtext library.
library(readtext)
Right now you need to set the DATA_DIR variable, which is going to be your workplace. When you install the readtext package, it comes with some examples that are installed at the package's location.
DATA_DIR <- system.file("extdata/", package = "readtext")
On a Windows machine you should see a similar output:
1] "C:/Program Files/R/R-3.6.3/library/readtext/extdata"
It has several subfolders like the following:
├───csv
├───json
├───pdf
│ └───UDHR
├───tsv
├───txt
│ ├───EU_manifestos
│ ├───movie_reviews
│ │ ├───neg
│ │ └───pos
│ └───UDHR
└───word
If you want to load the data from the word folder, the following needs to be done:
word_data <- readtext(paste0(DATA_DIR, "/word/*"))
Now word_data contains the following information:
readtext object consisting of 6 documents and 0 docvars.
# Description: df[,2] [6 x 2]
doc_id text
<chr> <chr>
1 21Parti_Socialiste_SUMMARY_2004.doc "\"[pic]\r\nRés\"..."
2 21vivant2004.doc "\"http://www\"..."
3 21VLD2004.doc "\"http://www\"..."
4 32_socialisti_democratici_italiani.doc "\"DIVENTARE \"..."
5 UK_2015_EccentricParty.docx "\"The Eccent\"..."
6 UK_2015_LoonyParty.docx "\"The Offici\"..."
Now let's get back to the stolenguns folder. You can use the above option to simply specify the DIR_PATH folder for stolenguns, or use each file separately.
> gun_data_q1 <- readtext("C:/Users/dszabo/Desktop/stolenguns/2016-first-quarter-stolen-guns.csv")
> gun_data_q2 <- readtext("C:/Users/dszabo/Desktop/stolenguns/2016-second-quarter-stolen-guns.csv")
> gun_data_q3 <- readtext("C:/Users/dszabo/Desktop/stolenguns/2016-third-quarter-stolen-guns.csv")
> gun_data_q4 <- readtext("C:/Users/dszabo/Desktop/stolenguns/2016-fourth-quarter-stolen-guns.csv")
Each variable will hold something similar.
readtext object consisting of 7 documents and 9 docvars.
# Description: df[,11] [7 x 11]
doc_id text Date Brand Model Color Stolen Stolen.From Status Incident.number Agency
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 2016-first-quar~ "\"P1382~ 01/06~ HI POI~ "9MM" "BLK" Stolen ~ Vehicle Recovere~ B16-00694 BPD
2 2016-first-quar~ "\"P1417~ 01/15~ JENNIN~ "" "COM" Stolen ~ Residence Not Reco~ B16-01892 BPD
3 2016-first-quar~ "\"P1437~ 01/24~ CENTUR~ "M92" "" Stolen ~ Residence Recovere~ B16-03125 BPD
4 2016-first-quar~ "\"P1470~ 02/08~ TAURUS "PT740~ "" Stolen ~ Residence Not Reco~ B16-05095 BPD
5 2016-first-quar~ "\"P1504~ 02/23~ HIGHPO~ "CARBI~ "" Stolen ~ Residence Recovere~ B16-06990 BPD
6 2016-first-quar~ "\"P1504~ 02/23~ RUGAR "" "" Stolen ~ Residence Recovere~ B16-06990 BPD
# ... with 1 more row
There is an option where you can customize what is loaded into your data frame called document level metadata. You can take docvars from filenames, and it even allows you to name them individually. The devsep argument defines a separator or a regular-expression character string.
gun_data_q4 <- readtext("C:/Users/dszabo/Desktop/stolenguns/2016-fourth-quarter-stolen-guns.csv",docvarsfrom = "filenames", dvsep = "_", encoding = "ISO-8859-1")
This should produce the following result.
doc_id text Date Brand Model Color Stolen Stolen.From Status Incident.number Agency docvar1
<chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
1 2016-fourth-quarter-stolen-guns.csv.1 "\"P22093\"..." 10/25/2016 SMITH AND WESSON "SD9VE" "" Stolen Locally Vehicle Not Recovered B16-42866 BPD 2016-fourth-quarter-stolen-guns
2 2016-fourth-quarter-stolen-guns.csv.2 "\"P22183\"..." 10/27/2016 TAURUS "PT111G2" "BLACK" Stolen Locally Residence Not Recovered B16-43134 BPD 2016-fourth-quarter-stolen-guns
3 2016-fourth-quarter-stolen-guns.csv.3 "\"P22497\"..." 11/07/2016 SIG SAUER "P290" "" Stolen Locally Vehicle Not Recovered B16-44838 BPD 2016-fourth-quarter-stolen-guns
4 2016-fourth-quarter-stolen-guns.csv.4 "\"P22910\"..." 11/18/2016 TAURUS "85UL" "SILVER" Stolen Locally Residence Not Recovered B16-46503 BPD 2016-fourth-quarter-stolen-guns
5 2016-fourth-quarter-stolen-guns.csv.5 "\"P23536\"..." 12/07/2016 SMITH & WESSON "" "" Stolen Locally Vehicle Not Recovered B16-48692 BPD 2016-fourth-quarter-stolen-guns
6 2016-fourth-quarter-stolen-guns.csv.6 "\"P23657\"..." 12/09/2016 COBRA ".380" "BLACK" Stolen Locally Residence Not Recovered B16-49060 BPD 2016-fourth-quarter-stolen-guns
The way you approach using the readtext module is very dependent on the actual formatting your data takes.
Conclusion
In this guide we have seen what facilities are provided by R to work with common formatted text files. We have seen what prerequisites are there to help us on our journey, and grasped the foundation that helps us move further. I hope this guide has been informative to you and I would like to thank you for reading it.