Visualising large data sets in R

November 18, 2018    R big data visualisation

This is a quick run down of two methods of visualising large data sets in R. Exploratory data analysis (EDA) is extremely important for analysts, scientists, or anyone who is examining an unfamiliar data set. The default R plots are usually insufficient once you get beyond a few dozen values.

In the following example, I will start by creating a data frame with two columns, and each will contain 10,000,000 observations.

n = 1e7
sdata = data.frame(y1 = rnorm(n, mean = 0, sd = 0.5 ))
sdata = cbind(sdata, x2 = ifelse(runif(n) > 0.7, sdata$y1 + rnorm(n, mean = 0, sd = 1), -sdata$y1 + rnorm(n, mean = 3, sd = 1)))

Using the base R plot function takes several minutes to load the following plot on a high end machine. The result is neither useful, nor particularly attractive. It shows us very little about the distribution or density of these observatons - they appear to be equally dense everywhere.
plot(sdata$x2, sdata$y1, main = 'Example of default plot()', cex.axis = 1.5, cex.main = 1.5, cex.lab = 1.5, xlim = c(-4, 7), ylim = c(-2, 2), asp = 2)


The function smoothScatter() in base R takes only a couple of seconds to visualise these ten million observations, and provides a smooth density plot.

with(sdata, smoothScatter(x2, y1, main = 'Example of smoothScatter()', cex.axis = 1.5, cex.main = 1.5, cex.lab = 1.5, xlim = c(-4, 7), ylim = c(-2, 2), asp = 2))


Another option is to use the package hexbin, which performs a very fast version of hexagonal binning. A basic default plot is already quite informative.

library(hexbin)
plot(hexbin(sdata$x2, sdata$y1, xbins = 200), legend = FALSE, main = 'Example of default hexbin()', xlab = 'x2', ylab = 'y1')


Hexbin is extremely flexible, and can help you visualise your data with attractive colours. A simple custom colour map is implemented below.

cols = colorRampPalette(c("#fee6ce", "#fd8d3c", "#e6550d", "#a63603"))
plot(hexbin(sdata$x2, sdata$y1, xbins = 40), colorcut = seq(0,1,length=20), colramp = function(n) cols(20), legend = FALSE, main = 'Example of coloured hexbin()', xlab = 'x2', ylab = 'y1')


So there are two brief examples of how it's possible to quickly start examining large data sets in R. I hope this helps with your EDA!