June 27 - June 30 2016
Stanford University, Stanford, California
The materials used in the tutorial are available here.
The ability to easily collect and gather a large amount of data from different sources can be seen as an opportunity to better understand many processes. It has already led to breakthroughs in several application areas. However, due to the wide heterogeneity of measurements and objectives, these large databases often exhibit an extraordinary high number of missing values. Hence, in addition to scientific questions, such data also present some important methodological and technical challenges for data analyst.
The aim of this tutorial is to present an overview of the missing values literature as well as the recent improvements that caught the attention of the community due to their ability to handle large matrices with large amount of missing entries.
We will touch upon the topics of single imputation with a focus on matrix completion methods based on iterative regularized SVD, notions of confidence intervals by giving the fundamentals of multiple imputation strategies, as well as issues of visualization with incomplete data. The approaches will be illustrated with real data with continuous, binary and categorical variables using some of the main R packages Amelia, mice, missForest, missMDA, norm, softimpute, VIM.
Through computer practicals using several R packages, participants will learn how to apply the statistical methods introduced in the course to realistic datasets from different fields (biomedical, social science, etc.)
Elementary knowledge of general statistical concepts and (linear) statistical models is assumed Basic knowledge in singular-values decomposition and principal component analysis could be useful.
Please make sure to download the R packages missMDA
, FactoMineR
, VIM
, missForest
, Amelia
and mice
.
missMDA: