Tutorial: Regression on large data sets: big(g)lm and
other approaches
|
Everyone knows that R can handle only small data sets. This
tutorial will look at ways to show that `everyone' is wrong.
There are three main approaches. For data sets with up to a few
hundreds of thousands of rows it is possible to perform the regressions
in R if only the necessary variables are loaded. For larger
data sets we can use incremental updates of bounded memory computations
as in the biglm package, or perform the large-data computations
directly in a database.
|
1) Why does lm() use a lot of memory?
2) Data examples 3) A little SQL: Load-on-demand regression 5) Bounded-memory algorithms 6) One pass: biglm 7) Iterative: bigglm 8) More SQL: pushing computations to the database. |
Users of R who want to analyse data sets that cannot fit conveniently
into memory. The focus will be on linear and generalized
linear models, but the techniques are relevant to other computations.
|