Tutorial: Analysing large data with many models
|
A common pattern
in the analysis of large data is to perform the same analysis on many
(many many) smaller subsets of data. In this tutorial you will learn
some basic statistical and computation strategies for solving this type
of problem. During the course of the tutorial we will fit tens of
thousands of linear models to a variety of datasets and explore how we
can summarise the results to gain insight into our data.
The basic steps of
large data analysis that we will follow are:
* Identify and fit an appropriate model for a single subset of the data * Fit the model to every subset * Examine model fit statistics to identify subsets that don't follow the same pattern, and modify and refit the model if necessary * Look at coefficients and other summary statistics across all subsets * Create a single model that summarises the many smaller models You will learn how
to tackle these problems using commands purely in base R, but we will
also use the plyr package extensively. The plyr package provides a
toolbox for a common set of problems: you need to break a big problem
down into manageable pieces, operate on each pieces and then put all
the pieces back together. It's already possible to do this with split
and the apply functions, but plyr just makes it all a bit easier with
consistent names, arguments and
outputs; input from and output to data.frames, matrices and lists; progress bars to keep track of long running operations; and built-in error recovery. We will touch on issues of computational efficiency, including approaches for caching your work so that in the event of unanticipated error or machine failure you lose little time, but the main emphasis will be on learning the most about your data, not with fitting data in memory or similar problems. Participants should be familiar with the basic tools of linear models in R, and should have struggled with analysing large data in the past. |