 
 
    
      Tutorial: An introduction to data cleaning with R
    
    
    
    Mark van der Loo, 
Statistics Netherlands - m.vanderloo@cbs.nl.
Edwin de Jonge, Statistics Netherlands - e.dejonge@cbs.nl
     
    
      Overview
    
    
Raw statistical data is rarely of sufficient quality to allow for immediate statistical analyses. In
many cases, one has to take care of invalid or missing values in a dataset before one can start
analysing the data. Such a process becomes even more complex when variables in a record-
set are interrelated by consistency rules. For example, in a survey where one asks for gender
(male, female) and pregnant (yes, no), the combination (gender=male, pregnant=yes) is invalid
for obvious reasons.
    
      Goals
    
    
      Outline
    
This tutorial will cover a number of tools and methods that allow for automated and reproducible
data cleaning. Topics discussed include
- 
General introduction to reproducible data cleaning strategies
- Detection and reporting of missing values
- Checking data against predefined rules, such as numeric ranges, sum rules and
restrictions on categorical data
- Visualisation of data quality
- Localizing erroneous fields in a dataset
- Adjusting erroneous fields and imputing missing values
Some, but not all elements of the tutorial will make use of R-packages to which the authors
have contributed, most notably: tabplot, editrules, deducorrect and rspa.
      Prerequisites
    
The tutorial will include practical exercises so bring your laptop!
    
      Intended Audience
    
    
      Workshop Materials
    
    
      Related Links