Working with Compressed Files and Common Document Formats in R
May 4, 2020 • 5 Minute Read
Introduction
How can we process compressed and common document formats and their data with R? In this guide, we are going to take a look at which packages support this activity, then demonstrate how to process a compressed file and a common document format. First the concept of compressed files and the common document formats will be explained, then R will help us make sense of all the data that is hidden inside.
Compression
Most of the time when you use R for data mining or scientific purposes, you find at the very beginning that the data you are working with is huge in size, from a couple hundred MBs to GBs or even TBs. Depending on your situation you may not have the physical capacity on your computer to have these files as they are.
Enter compression. Most of the time the numeric data you are working with has a pretty decent compress ratio, and by utilizing this technique you can save your ever precious computing resources by simply storing the data compressed, and let R do the magic. You must make a clear distinction between the following two cases:
- Have your computed data in data frames that you want to write to a disk as a compressed file
- Have a bunch of smaller files compressed together in a single file and work with those
This guide will take a look at both cases.
Common Document Format
Common document formats are going to be familiar to you; they are documents like PDF, Word, Exel, etc. R's ability to work with these type of documents makes it very powerful. In many cases the information you need to process is hidden in documents received from other departments because they may not have the ability to provide you the data you need in the most efficient format.
Reading Compressed Files
Since version 2.10 of R, a new feature is available that helps you to read content of a compressed file and treat it as a text file. This file should be compressed with either bzip2, xvz, or gzip. You can visit sbeams to grab an example dataset and try this out. Once the example file is downloaded, you can spin up the R console and load it.
r <- read.table("C:/Users/dszabo/Downloads/External_test_data.tar.gz")
Depending on the size of your dataset it may take some time to load. On windows you should have forward slashes (/) in your path that you specify.
Creating Compressed Files
There may be a situation when you have to abort your work but don't want to lose the progress you made, or want to transfer the dataset you have on your PC to share it with your co-worker. You have the option to export it, and compress it without any hustle.
Let's create some dummy data.
X <- matrix(rnorm(1e8), ncol=10)
The size will be about 1.5GB in RAM. If you want to, you can make it smaller by using 1e7. Now you can write this data to a file. You are able to use two functions here: write.table() and save().
This function does not compress the file, so if you want to reduce the size you should use an R-compatible tool like gzip.
write.table(X, file="C:/temp/progress.Rbin", sep=",", row.names=FALSE, col.names=FALSE)
This one is able to compress the file:
save(X,file="C:/temp/progress.Rbin", compress=T)
The size of the file when you are compressing it is based on the data you are working with. In this demo case the original size was 1.5GB and the compressed size became 0.75GB, which is a 50% compress ratio.
Working with Common Documents
There are multiple packages that provide similar functionality, but the example in this guide will use the readtext module, which comes with preinstalled examples.
In order to use this package it needs to be installed:
install.packages("readtext")
Once the package is installed it drops a sample folder under your default installation folder. On windows it is located under the C:/Program Files/R/R-3.6.3/library/readtext/extdata. This folder contains sample PDF files that this guide will use for demonstration.
Load the readtext library:
library(readtext)
Now you need to initialize the DATA_DIR variable:
DATA_DIR <- system.file("extdata/", package = "readtext")
It should produce the following output:
1] "C:/Program Files/R/R-3.6.3/library/readtext/extdata"
If you want to access and load the PDF files you need to issue the following line:
pdf_data <- readtext(paste0(DATA_DIR, "/pdf/*.pdf"))
The pdf_data now holds the following information:
readtext object consisting of 1 document and 0 docvars.
# Description: df[,2] [1 x 2]
doc_id text
<chr> <chr>
1 UK_natl_2005_en_PVP.pdf "\" ⌧ Pro\"..."
Conclusion
In this guide you learned how to work with compressed data, and create compressed data from your R console. You have also learned how to work with common document types. I hope this guide has been informative to you and I would like to thank you for reading it.