useR! 2010: Exploratory Data Analysis with a special focus on clustering and multiway methods

Tutorial: Exploratory data analysis with a special focus on clustering and multiway methods

François Husson, Agrocampus Rennes, Rennes Centre for Higher Education and Research in Agronomy, Rennes Cedex, France
Julie Josse, Agrocampus Rennes, Rennes Centre for Higher Education and Research in Agronomy, Rennes Cedex, France

Motivation

Nowadays, researchers have to handle complex data, hence the need to sum them up and to visualize the information in a proper and convenient way. This course presents first some classical methods for data mining to explore and visualize data such as Principal Components Analysis and Multiple Correspondence Analysis. Then we will focus on Multiple Factor Analysis which allows one to take into account data sets structured in groups of variables. We will see how theses methods can handle heterogeneous data (continuous and categorical). Finally, we will present methods of clustering (hierarchical clustering and k-means algorithms), with emphasis on the complementarity between clustering and exploratory methods. We illustrate these different methods through data sets with origins in fields such as genomics (mouse and human tumor data), ecology, and sensometrics (wine data).

Outline

Topics will include:

1) Multivariate exploratory data analysis (PCA, MCA)

a. Introduction: main objectives of these methods (reducing the dimensionality of the data sets, sum up the information, individuals and variables typology); type of variables used (continuous or categorical variables and possibly both)
b. Principal Components Analysis: simultaneous interpretation of the graphs for individuals and variables, how to introduce supplementary information such as supplementary individuals, when to introduce supplementary continuous or categorical variables. Supplementary information is not involved in the construction of the analysis but is a tool to facilitate the interpretation of the analysis. How to quickly describe the dimensions will also be discussed
c. Exploratory analysis of categorical data (such as survey data) with Multiple Correspondence Analysis

2) Multiple Factor Analysis (MFA)
Datasets in which individuals are described by several groups of variables are often modeled. For example, in genomics data analysis, it is interesting to combine DNA and protein data; in sensory analysis, products are described by both sensory attributes and physico-chemical variables; in survey analysis, items are frequently structured in themes. It is thus useful in the statistical analysis to keep track of this data structure and enrich the analysis with this structure. Multiple Factor Analysis is an exploratory data analysis (a kind of extension of PCA) which takes into account the set of variables.
- a. Examples of data sets involved in MFA and main objectives
- b. MFA: a weighted PCA which allows to balance the influence of the group of variables in the analysis
- c. Interpretation of the results (graphs and outputs) obtained by MFA
  - classical (as in PCA or MCA): typology of individuals and variables
  - specifics: the groups of variables graph (to visualize the relationship between sets of variables; to compare the information represented by each set of variables; and to ascertain whether the relative positions of individuals in one group are similar to those in another group); ``the partial representation'' (to visualize the individual representation obtained by each set of variables, as in Procrustes analysis); ``the partial axes'' graph (to compare the results of PCA or MCA performed on each set of variables)
- d. How to introduce supplementary information such as supplementary groups of variables
- e. To go further: how to take into account a hierarchy on the variables: Hierarchical MFA
3) Clustering In this section we will present an example of unsupervised clustering which consists in combining exploratory methods (PCA, MCA, MFA) with clustering methods. This approach is based on the fact that exploratory methods result in principal components which are continuous variables which best sum-up all the variables in the (continuous, categorical or heterogeneous) data at hand. Consequently, clustering methods are often used on these components (exploratory methods are used as a pre-processing step). Using exploratory methods give us more: we can benefit from the factorial maps and all the results provided from exploratory methods to enrich the interpretation and the description of the clusters! We will pay a lot of attention to the description of the clusters.

a. Introduction: presentation of hierarchical clustering and k-means methods
b. Clustering: using PCA, MCA and MFA as a preprocessing for clustering/ clustering
c. Choosing the number of clusters
d. Representation of the clustering on the factorial map to combine information provided from exploratory methods and clustering methods
e. Automatic characterization of the clusters with continuous and categorical variables
f. To go further: Hierarchical clustering on big data sets (with a large number of individuals and of variables)

The different methods will be illustrated with numerous examples and we will use one or more packages such as FactoMineR.

Intended audience

Researchers in applied fields, teachers in data mining and data analysis, statisticians whose are interested in multivariate analysis and multiway analysis

Background knowledge

No prior knowledge is required. Basic knowledge in PCA is welcome.

Tutorial Materials

Slides are here.