Programming with Big Data in R - useR! 2013 Tutorial

Tutorial: Programming with Big Data in R

George Ostrouchov , Oak Ridge National Laboratory
Drew Schmidt , University of Tennessee

Overview

This tutorial will introduce attendees to High Performance Computing (HPC) concepts for dealing with big data using R. The content is particularly well suited for use on large distributed platforms and it is also accessible from small multicore platforms.

Parallel programming for distributed platforms is most naturally done with a Single Program/Multiple Data (SPMD) viewpoint. This programming model is used by the vast majority of the HPC community. A major focus of this tutorial will be introducing attendees to this viewpoint, and contrasting it with R's usual manager/worker viewpoint and map-reduce variants.

In this tutorial we will:

Motivate the need for parallelism in all stages of big data processing
Discuss the pbdR system of packages which make R scalable
Introduce basic MPI programming concepts
Learn how to handle big data with parallel processing, using "nearly native" R syntax
Detail numerous examples motivated from linear regression to clustering.

Goals

The tutorial aims to introduce the basics of parallel programming in the SPMD programming model using MPI and the pbdR system of packages.

Additionally, we hope to engage package developers to instrument their packages with pbdR so that more R analytics become scalable on large computational platforms and to motivate our further development of pbdR by specific user needs.

Outline

Parallel programming and pbdR Basics
Introduction to MPI programming in R
Detailed pbdMPI examples
Introduction to distributed matrices with pbdDMAT
Detailed pbdDMAT examples

Prerequisites

We assume intermediate knowledge of R. No prior parallel programming experience is necessary. If you wish to follow along on your multicore laptop during the tutorial, please install (or check that you have):

R (and Rtools if you are a Windows user)
An MPI library
the pbdR packages

Please see our installation instructions on each major platform.

Intended Audience

The R programmer with an interest in parallel programming and a need to handle very large data.

Workshop Materials

Slides and source code for the tutorial will be made available by the first week of July 2013 on the pbrR website.

Thank you for registering to participate in the "Programming with Big Data in R" tutorial. The tutorial is structured so that you can follow along "lecture style" or you can engage with the examples "hands on." Here are a few suggestions that will allow you to get the most out of this tutorial.

We will have login tokens available for attendees to use the supercomputer Nautilus, a 1024 core SGI system at NICS. Use of this resource is optional even for hands-on purposes, but all pbdR packages are installed there, making following along easier. You will need an ssh client (such as Putty on Windows; Mac/Linux come with ssh by default) in order to make use of this resource. We note that this is our first tutorial use of this resource across the Atlantic. We expect things to go smoothly, but if not, your multicore laptop is the backup.
To get the most out of the hands-on portion of the tutorial, you will need to install the latest versions of our packages. Development versions are available from r-pbd.org. However, we intend to submit the stable versions to CRAN within a few days. We expect the newest version to be on the CRAN by Monday, July 8th.
Instructions for installing pbdR packages are available from http://r-pbd.org/install.html. Should you need assistance with installing the packages, please contact us at RBigData@gmail.com