An introduction to data cleaning with R

Tutorial: An introduction to data cleaning with R

Mark van der Loo, Statistics Netherlands - m.vanderloo@cbs.nl.
Edwin de Jonge, Statistics Netherlands - e.dejonge@cbs.nl

Overview

Raw statistical data is rarely of sufficient quality to allow for immediate statistical analyses. In many cases, one has to take care of invalid or missing values in a dataset before one can start analysing the data. Such a process becomes even more complex when variables in a record- set are interrelated by consistency rules. For example, in a survey where one asks for gender (male, female) and pregnant (yes, no), the combination (gender=male, pregnant=yes) is invalid for obvious reasons.

Goals

Outline

This tutorial will cover a number of tools and methods that allow for automated and reproducible data cleaning. Topics discussed include

General introduction to reproducible data cleaning strategies
Detection and reporting of missing values
Checking data against predefined rules, such as numeric ranges, sum rules and restrictions on categorical data
Visualisation of data quality
Localizing erroneous fields in a dataset
Adjusting erroneous fields and imputing missing values

Some, but not all elements of the tutorial will make use of R-packages to which the authors have contributed, most notably: tabplot, editrules, deducorrect and rspa.

Prerequisites

The tutorial will include practical exercises so bring your laptop!

Tutorial: An introduction to data cleaning with R

Overview

Goals

Outline

Prerequisites

Intended Audience

Workshop Materials

Related Links