|
Tutorial: Exploratory data analysis with a special focus on clustering and multiway methods
|
François Husson, Agrocampus Rennes, Rennes Centre for Higher Education and Research in Agronomy, Rennes Cedex, France
Julie Josse, Agrocampus Rennes, Rennes Centre for Higher Education and Research in Agronomy, Rennes Cedex, France
Motivation
Nowadays, researchers have to handle complex data, hence the need to
sum them up and to visualize the information in a proper and convenient
way.
This course presents first some classical methods for data mining to
explore and visualize data such as Principal Components Analysis and
Multiple Correspondence Analysis. Then we will focus on Multiple Factor
Analysis which allows one to take into account data sets structured in
groups of variables. We will see how theses methods can handle
heterogeneous data (continuous and categorical). Finally, we will
present methods of clustering (hierarchical clustering and k-means
algorithms), with emphasis on the complementarity between
clustering and exploratory methods.
We illustrate these different methods through data sets with origins in
fields such as genomics (mouse and human tumor data), ecology,
and sensometrics (wine data).
Outline
Topics will include:
- 1) Multivariate exploratory data analysis (PCA, MCA)
- a. Introduction: main objectives of these methods (reducing the
dimensionality of the data sets, sum up the information, individuals and
variables typology); type of variables used (continuous or categorical
variables and possibly both)
- b. Principal Components Analysis: simultaneous interpretation of the
graphs for individuals and variables, how to introduce
supplementary information such as supplementary individuals,
when to introduce supplementary continuous or categorical variables.
Supplementary
information is not involved in the construction of the analysis but
is a tool to facilitate the interpretation of the analysis. How to
quickly describe the dimensions will also be discussed
- c. Exploratory analysis of categorical data (such as survey data) with
Multiple Correspondence Analysis
- 2) Multiple Factor Analysis (MFA)
Datasets in which individuals
are described by several groups of variables are often modeled.
For example, in genomics
data analysis, it is interesting to combine DNA and protein data; in
sensory analysis, products are described by both sensory attributes and
physico-chemical variables; in survey analysis, items are frequently
structured in themes. It is thus useful in the statistical analysis to
keep track of this data structure and enrich the analysis with this
structure. Multiple Factor Analysis is an exploratory data analysis (a
kind of extension of PCA) which takes into account the set of variables.
- a. Examples of data sets involved in MFA and main objectives
- b. MFA: a weighted PCA which allows to balance the influence of the
group of variables in the analysis
- c. Interpretation of the results (graphs and outputs) obtained by MFA
- classical (as in PCA or MCA): typology of individuals and variables
- specifics: the groups of variables graph (to visualize the
relationship between sets of variables; to compare the information represented
by each set of variables; and to ascertain whether the relative positions
of individuals in
one group are similar to those in another group); ``the partial
representation'' (to visualize the individual representation obtained by
each set of variables, as in Procrustes analysis); ``the partial axes''
graph (to compare the results of PCA or MCA performed on each set of
variables)
- d. How to introduce supplementary information such as supplementary
groups of variables
- e. To go further: how to take into account a hierarchy on the variables:
Hierarchical MFA
- 3) Clustering
In this section we will present an example of unsupervised clustering
which consists in combining exploratory methods (PCA, MCA, MFA) with
clustering methods. This approach is based on the fact that exploratory
methods result in principal components which are continuous
variables which best sum-up all the variables in the
(continuous, categorical or heterogeneous) data at hand. Consequently,
clustering methods are often used on these components (exploratory
methods are used as a pre-processing step). Using exploratory methods
give us more: we can benefit from the factorial maps and all the
results provided from exploratory methods to enrich the interpretation
and the description of the clusters! We will pay a lot of attention to
the description of the clusters.
- a. Introduction: presentation of hierarchical clustering and k-means methods
- b. Clustering: using PCA, MCA and MFA as a preprocessing for clustering/
clustering
- c. Choosing the number of clusters
- d. Representation of the clustering on the factorial map to combine
information provided from exploratory methods and clustering methods
- e. Automatic characterization of the clusters with continuous and
categorical variables
- f. To go further: Hierarchical clustering on big data sets (with a large
number of individuals and of variables)
The different methods will be illustrated with numerous examples and we
will use one or more packages such as FactoMineR.
Intended audience
Researchers in applied fields, teachers in data mining and data analysis,
statisticians whose are interested in multivariate analysis and multiway
analysis
Background knowledge
No prior knowledge is required. Basic knowledge in PCA is welcome.
Related link
More information will be available (notes, scripts, and data sets) at
our website
http://factominer.free.fr.
Tutorial Materials
Slides are here.