|
Tutorial: Distributed Data Analysis using R
|
Stefan Rüping, Fraunhofer IAIS, Sankt Augustin, Germany.
Michael Mock, Fraunhofer IAIS, Sankt Augustin, Germany.
Dennis Wegener, Fraunhofer IAIS, Sankt Augustin, Germany.
Slides available
Follow this link.
Abstract
In the last couple of years, the amount of data to be analysed in different areas grows rapidly.
Examples range from natural sciences (e.g. astronomy or particle physics), business data (e.g.
a high increase use data volume is expected by the use of RFID technology), life sciences
(such as high-throughput genomics and post-genomics technologies) or data generated by
normal users on the internet (see Google, Youtube, etc.). The enormous growth of the amount
of data is complemented by advances in distributed computing technology enabling the data
analyst to handle this amount of data in reasonable time. Two main streams of current
distributed technology development and research are particularly useful in this respect: the
GRID technology is aiming at making data stores and computing facilities which are
geographically widely spread available for a common, global data analysis. The other stream
of development is cluster-based computing which transforms large amounts of standard
computers into high-performance computing bases.
However, even if the above mentioned advances in distributed computing technology make
available the computing and storage resources for handling large amounts of data, they
introduce another level of complexity in the system, such that the traditional data analyst, with
a strong background in statistics and application domain knowledge, might be overwhelmed
by the complexity of the underlying distributed technology. For instance, an application
developer using R might not be interested in any details of how web services are built.
Therefore, ongoing research aims at bridging the gap between advanced distributed
computing technology and traditional statistical software.
The goal of this tutorial is to inform the statistician, especially those using the R language,
about current trends in distributed computing technology and to show ways how to use and
integrate R programs in distributed environments – considering both, GRID and cluster-based
computing. As a particular challenging example, we will, among other, report from the
Advancing Clinico-Genomics Trials on Cancer project (ACGT) which aims at providing a
data analysis environment that allows the exploitation of an enormous pool of data collected
in European cancer treatments.
In the context of this project, the GridR package was developed, which was one of the first
attempts to connect R to a grid environment - to grid-enable R. We will give an introduction
into distributed data analysis and data exchange in the context of R and a detailed description
of the gridR package. Then, we will show a real world example of distributed data analysis
using R, referring to a scenario from the clinico-genomic area in the context of the ACGT
project.
Goal of the tutorial is to make the attendees familiar with the principles of distributed
computing, discuss relevant R packages (such as GridR) that provide access to distributed
computing environments. People will learn how to make use of a distributed environment
from their local programs.
Audience
- People who are interested in using R in a distributed (e.g., grid or cluster computing) environment
- People who are interested in the parallelization of R programs
- Biostatisticians who want to learn how to adapt a scenario to GridR
Required Knowledge
- basic knowledge on R is expected
- no knowledge on cluster -, grid computing or the underlying technologies is assumed
- no knowledge in clinico-genomic data analysis is assumed
Outline
- General part
- Introduction to distributed data analysis
- Overview of R packages for distributed analysis
- Standards and Data Formats for distributed systems
- Web-Services and Cluster-Based Computing
- Overview of R packages for data exchange
- The GridR package
- Parallelization of computational tasks
- Collaboration among multiple users
- Real-world example
- the ACGT project
- Grid-based data analysis of clinico-genomic data
The tutorial will contain practical exercises to give the participants opportunity to get a firsthand
experience of distributed data analysis, and even give them the opportunity to test out
how to distribute their own programs.
Organizers
- Dr. Stefan Rüping is leader of the working group on integrated data mining at the
Fraunhofer Fraunhofer Institute for Intelligent Analysis and Information Systems in
St. Augustin, Germany. He has studied computer science and received a Doctor
degree from the University of Dortmund in 2006. His research interests include
distributed knowledge discovery and large-scale data mining. He has been an
enthusiastic R user for years.
- PD Dr. Michael Mock has studied Computer Science at the University of Bonn from
1982 to 1987, receiving his Diploma degree in 1987. In 1985, he joined the GMD -
German National Research Center for Information Technology, where he has lead
several research projects cooperating with main European industrial IT-providers
(Siemens, Philips, Dassault-Electronique). He received his Doctor degree from the
University of Bonn in 1995 and his Habilitation degree from the University of
Magdeburg in 2004, being a member of the Department of Computer Science since
then. Dr. Mock is currently working as senior scientist in the Fraunhofer-Institute for
Intelligent Analysis and Information Systems in the department KD (Knowledge
Discovery). His research interests include distributed systems, integrated data-mining
systems and real-time systems. He has been lecturing at the University of Magdeburg,
University of Bonn, the Universities of Applied Sciences Cologne and Bonn-Rhein-
Sieg. He has authored over 60 publications, including textbooks on programming and
wireless networks. He is a member of the GI (Gesellschaft für Informatik e.V.) and its
special interest group on operating systems.
- Dennis Wegener has studied Computer Science at the University of Bonn from 2001
to 2006, receiving his Diploma degree in 2006. Since 2006 he works as research
fellow at the Fraunhofer Institute for Intelligent Analysis and Information Systems
IAIS, Department Knowledge Discovery, in Sankt Augustin, Germany. His research
interests include data mining and grid computing. He was one of the main developers
of the DataMiningGrid system and is currently involved as a key person in several
European projects on Data Mining and Grid computing, e.g. ACGT and Simdat.
References
- Dennis Wegener, Thierry Sengstag, Stelios Sfakianakis, Stefan Rüping and Anthony
Assi. GridR: An R-based grid-enabled tool for data analysis in ACGT clinico-genomic
trials. In: Proceedings of the 3rd International Conference on e-Science and Grid
Computing (eScience 2007), Bangalore, India.
- Stefan Rüping, Stelios Sfakianakis and Manolis Tsiknakis. Extending Workflow
Management for Knowledge Discovery in Clinico-Genomic Data. In: From Genes to
Personalized HealthCare: Grid Solutions for the Life Sciences, Proceedings of
HealthGrid 2007, pp. 183-193, IOS Press, 2007.
- Vlado Stankovski, Martin Swain, Valentin Kravtsov, Thomas Niessen, Dennis
Wegener, Joerg Kindermann, and Werner Dubitzky. Grid-enabling data mining
applications with DataMiningGrid: An architectural perspective. Future Generation
Computer Systems Journal, 2007.
- Vlado Stankovski, Martin Swain, Valentin Kravtsov, Thomas Niessen, Dennis
Wegener, Matthias Röhm, Jerney Trnkoczy, Michael May, Jürgen Franke, Assaf
Schuster and Werner Dubitzky. Digging Deep into the Data Mine with
DataMiningGrid. IEEE Internet Computing, accepted for publishing in 2007.
- Dennis Wegener and Michael May. Extensibility of Grid-Enabled Data Mining
Platforms: A Case Study. In Proc. of the 5th International Workshop on Data Mining
Standards, Services and Platforms, KDD 2007, pages 13--22, San Jose, USA, August
2007. ISBN 978-1-59593-838-1.
Links