Tutorial: Distributed Data Analysis using R

Stefan Rüping, Fraunhofer IAIS, Sankt Augustin, Germany.
Michael Mock, Fraunhofer IAIS, Sankt Augustin, Germany.
Dennis Wegener, Fraunhofer IAIS, Sankt Augustin, Germany.

Slides available

Follow this link.

Abstract

In the last couple of years, the amount of data to be analysed in different areas grows rapidly. Examples range from natural sciences (e.g. astronomy or particle physics), business data (e.g. a high increase use data volume is expected by the use of RFID technology), life sciences (such as high-throughput genomics and post-genomics technologies) or data generated by normal users on the internet (see Google, Youtube, etc.). The enormous growth of the amount of data is complemented by advances in distributed computing technology enabling the data analyst to handle this amount of data in reasonable time. Two main streams of current distributed technology development and research are particularly useful in this respect: the GRID technology is aiming at making data stores and computing facilities which are geographically widely spread available for a common, global data analysis. The other stream of development is cluster-based computing which transforms large amounts of standard computers into high-performance computing bases.

However, even if the above mentioned advances in distributed computing technology make available the computing and storage resources for handling large amounts of data, they introduce another level of complexity in the system, such that the traditional data analyst, with a strong background in statistics and application domain knowledge, might be overwhelmed by the complexity of the underlying distributed technology. For instance, an application developer using R might not be interested in any details of how web services are built. Therefore, ongoing research aims at bridging the gap between advanced distributed computing technology and traditional statistical software.

The goal of this tutorial is to inform the statistician, especially those using the R language, about current trends in distributed computing technology and to show ways how to use and integrate R programs in distributed environments considering both, GRID and cluster-based computing. As a particular challenging example, we will, among other, report from the Advancing Clinico-Genomics Trials on Cancer project (ACGT) which aims at providing a data analysis environment that allows the exploitation of an enormous pool of data collected in European cancer treatments.

In the context of this project, the GridR package was developed, which was one of the first attempts to connect R to a grid environment - to grid-enable R. We will give an introduction into distributed data analysis and data exchange in the context of R and a detailed description of the gridR package. Then, we will show a real world example of distributed data analysis using R, referring to a scenario from the clinico-genomic area in the context of the ACGT project.

Goal of the tutorial is to make the attendees familiar with the principles of distributed computing, discuss relevant R packages (such as GridR) that provide access to distributed computing environments. People will learn how to make use of a distributed environment from their local programs.

Audience

People who are interested in using R in a distributed (e.g., grid or cluster computing) environment
People who are interested in the parallelization of R programs
Biostatisticians who want to learn how to adapt a scenario to GridR

Required Knowledge

basic knowledge on R is expected
no knowledge on cluster -, grid computing or the underlying technologies is assumed
no knowledge in clinico-genomic data analysis is assumed

Outline

General part
- Introduction to distributed data analysis
- Overview of R packages for distributed analysis
- Standards and Data Formats for distributed systems
- Web-Services and Cluster-Based Computing
- Overview of R packages for data exchange
The GridR package
- Parallelization of computational tasks
- Collaboration among multiple users
Real-world example
- the ACGT project
- Grid-based data analysis of clinico-genomic data

The tutorial will contain practical exercises to give the participants opportunity to get a firsthand experience of distributed data analysis, and even give them the opportunity to test out how to distribute their own programs.

Organizers

Dr. Stefan Rüping is leader of the working group on integrated data mining at the Fraunhofer Fraunhofer Institute for Intelligent Analysis and Information Systems in St. Augustin, Germany. He has studied computer science and received a Doctor degree from the University of Dortmund in 2006. His research interests include distributed knowledge discovery and large-scale data mining. He has been an enthusiastic R user for years.
PD Dr. Michael Mock has studied Computer Science at the University of Bonn from 1982 to 1987, receiving his Diploma degree in 1987. In 1985, he joined the GMD - German National Research Center for Information Technology, where he has lead several research projects cooperating with main European industrial IT-providers (Siemens, Philips, Dassault-Electronique). He received his Doctor degree from the University of Bonn in 1995 and his Habilitation degree from the University of Magdeburg in 2004, being a member of the Department of Computer Science since then. Dr. Mock is currently working as senior scientist in the Fraunhofer-Institute for Intelligent Analysis and Information Systems in the department KD (Knowledge Discovery). His research interests include distributed systems, integrated data-mining systems and real-time systems. He has been lecturing at the University of Magdeburg, University of Bonn, the Universities of Applied Sciences Cologne and Bonn-Rhein- Sieg. He has authored over 60 publications, including textbooks on programming and wireless networks. He is a member of the GI (Gesellschaft für Informatik e.V.) and its special interest group on operating systems.
Dennis Wegener has studied Computer Science at the University of Bonn from 2001 to 2006, receiving his Diploma degree in 2006. Since 2006 he works as research fellow at the Fraunhofer Institute for Intelligent Analysis and Information Systems IAIS, Department Knowledge Discovery, in Sankt Augustin, Germany. His research interests include data mining and grid computing. He was one of the main developers of the DataMiningGrid system and is currently involved as a key person in several European projects on Data Mining and Grid computing, e.g. ACGT and Simdat.

References

Dennis Wegener, Thierry Sengstag, Stelios Sfakianakis, Stefan Rüping and Anthony Assi. GridR: An R-based grid-enabled tool for data analysis in ACGT clinico-genomic trials. In: Proceedings of the 3rd International Conference on e-Science and Grid Computing (eScience 2007), Bangalore, India.
Stefan Rüping, Stelios Sfakianakis and Manolis Tsiknakis. Extending Workflow Management for Knowledge Discovery in Clinico-Genomic Data. In: From Genes to Personalized HealthCare: Grid Solutions for the Life Sciences, Proceedings of HealthGrid 2007, pp. 183-193, IOS Press, 2007.
Vlado Stankovski, Martin Swain, Valentin Kravtsov, Thomas Niessen, Dennis Wegener, Joerg Kindermann, and Werner Dubitzky. Grid-enabling data mining applications with DataMiningGrid: An architectural perspective. Future Generation Computer Systems Journal, 2007.
Vlado Stankovski, Martin Swain, Valentin Kravtsov, Thomas Niessen, Dennis Wegener, Matthias Röhm, Jerney Trnkoczy, Michael May, Jürgen Franke, Assaf Schuster and Werner Dubitzky. Digging Deep into the Data Mine with DataMiningGrid. IEEE Internet Computing, accepted for publishing in 2007.
Dennis Wegener and Michael May. Extensibility of Grid-Enabled Data Mining Platforms: A Case Study. In Proc. of the 5th International Workshop on Data Mining Standards, Services and Platforms, KDD 2007, pages 13--22, San Jose, USA, August 2007. ISBN 978-1-59593-838-1.