Tutorial: An introduction to data cleaning with R
Mark van der Loo,
Statistics Netherlands - m.vanderloo@cbs.nl.
Edwin de Jonge, Statistics Netherlands - e.dejonge@cbs.nl
Overview
Raw statistical data is rarely of sufficient quality to allow for immediate statistical analyses. In
many cases, one has to take care of invalid or missing values in a dataset before one can start
analysing the data. Such a process becomes even more complex when variables in a record-
set are interrelated by consistency rules. For example, in a survey where one asks for gender
(male, female) and pregnant (yes, no), the combination (gender=male, pregnant=yes) is invalid
for obvious reasons.
Goals
Outline
This tutorial will cover a number of tools and methods that allow for automated and reproducible
data cleaning. Topics discussed include
-
General introduction to reproducible data cleaning strategies
- Detection and reporting of missing values
- Checking data against predefined rules, such as numeric ranges, sum rules and
restrictions on categorical data
- Visualisation of data quality
- Localizing erroneous fields in a dataset
- Adjusting erroneous fields and imputing missing values
Some, but not all elements of the tutorial will make use of R-packages to which the authors
have contributed, most notably: tabplot, editrules, deducorrect and rspa.
Prerequisites
The tutorial will include practical exercises so bring your laptop!
Intended Audience
Workshop Materials
Related Links