Tutorial: Programming with Big Data in R
George Ostrouchov
, Oak Ridge National Laboratory
Drew Schmidt
, University of Tennessee
Overview
This tutorial will introduce attendees to High Performance Computing
(HPC) concepts for dealing with big data using R. The content is
particularly well suited for use on large distributed platforms and it
is also accessible from small multicore platforms.
Parallel programming for distributed platforms is most naturally done
with a Single Program/Multiple Data (SPMD) viewpoint. This programming
model is used by the vast majority of the HPC community. A major focus
of this tutorial will be introducing attendees to this viewpoint, and
contrasting it with R's usual manager/worker viewpoint and map-reduce
variants.
In this tutorial we will:
- Motivate the need for parallelism in all stages of big data processing
- Discuss the pbdR system of packages which make R scalable
- Introduce basic MPI programming concepts
- Learn how to handle big data with parallel processing, using
"nearly native" R syntax
- Detail numerous examples motivated from linear regression to
clustering.
Goals
The tutorial aims to introduce the basics of parallel programming in
the SPMD programming model using MPI and the pbdR system of packages.
Additionally, we hope to engage package developers to instrument their
packages with pbdR so that more R analytics become scalable on large
computational platforms and to motivate our further development of
pbdR by specific user needs.
Outline
- Parallel programming and pbdR Basics
- Introduction to MPI programming in R
- Detailed pbdMPI examples
- Introduction to distributed matrices with pbdDMAT
- Detailed pbdDMAT examples
Prerequisites
We assume intermediate knowledge of R. No prior parallel programming
experience is necessary.
If you wish to follow along on your multicore laptop during the
tutorial, please install (or check that you have):
- R (and Rtools if you are a Windows user)
- An MPI library
- the pbdR packages
Please see our
installation instructions on each major platform.
Intended Audience
The R programmer with an interest in parallel programming and a need
to handle very large data.
Workshop Materials
Slides and source code for the tutorial will be made available by the
first week of July 2013 on the pbrR website.
Thank you for registering to participate in the "Programming with Big Data in R" tutorial. The tutorial is structured so that you can follow along "lecture style" or you can engage with the examples "hands on." Here are a few suggestions that will allow you to get the most out of this tutorial.
-
We will have login tokens available for attendees to use the supercomputer Nautilus, a 1024 core SGI system at NICS. Use of this resource is optional even for hands-on purposes, but all pbdR packages are installed there, making following along easier. You will need an ssh client (such as Putty on Windows; Mac/Linux come with ssh by default) in order to make use of this resource. We note that this is our first tutorial use of this resource across the Atlantic. We expect things to go smoothly, but if not, your multicore laptop is the backup.
-
To get the most out of the hands-on portion of the tutorial, you will need to install the latest versions of our packages. Development versions are available from r-pbd.org. However, we intend to submit the stable versions to CRAN within a few days. We expect the newest version to be on the CRAN by Monday, July 8th.
-
Instructions for installing pbdR packages are available from http://r-pbd.org/install.html. Should you need assistance with installing the packages, please contact us at RBigData@gmail.com
Related Links
r-pbd.org