Talk Schedule

The talks will take place on 11-13 July 2018 (click the interested talk for its abstract). A datatable version is provided here, if you’re looking for a more easy-to-search & R-oriented format. Information for presenters is here.

Time Session Presenter Venue Title Keywords Chair Slides
13:45 Keynote Steph de Silva AUD Beyond syntax, towards culture: the potentiality of deep open source communities NA Di Cook click here
For over twenty years, R has been a programming language under development. In that time a collection of open source communities have sprung up around it. These communities have commonalities that are developing into a distinct programming subculture. The existence of a common subculture connecting these communities is important for two reasons: the power to create value and the potential to champion values.
15:30 Applications in society Richard Layton P8 Data, methods, and metrics for studying student persistence applications, community/education, persistence metrics, intersectionality, longitudinal Jessie Roberts click here
This paper introduces R users to data and tools for investigating undergraduate persistence metrics using the midfieldr package and its associated data packages. The data are the student records (registrar's data) of approximately 200,000 undergraduates at US institutions from 1990 to 2016. midfieldr provides functions for determining persistence metrics such as graduation rates and for grouping and displaying findings by program, institution, race/ethnicity, and sex. These packages provide an entry to this type of intersectional research for anyone with basic proficiency in R and familiarity with packages from the tidyverse. The goal of the paper is to introduce the packages and to share our data, methods, and metrics for intersectional research in student persistence.
15:50 Applications in society Maria Holcekova P8 The dynamic approach to inequality: Using longitudinal trajectories of young women and their parents in determining their socio-economic positions within the contemporary Western society visualisation, clustering, imputation, longitudinal data analysis Jessie Roberts NA
Intensified globalisation and ensuing increased affluence of Western populations has changed the composition of traditional social class system in England. This does not imply the disappearance of socio-economic (SE) classes and inequalities, but rather their redefinition. Unfortunately, limited research has considered the dynamic nature of SE positions, especially in understanding youth transitions from parental to personal SE classes. I address this problem using nationally representative longitudinal data in the Next Steps 1990 youth cohort study in England. Firstly, I explore the parental transition patterns using longCatPlot. Secondly, I visualise missing data through the missmap function in Amelia and impute these values using random forests in missForest. Thirdly, I employ the daisy function within the cluster to create SE groups based on Gower distance, partitioning around medoids, and silhouette width. Finally, I visualise these results using ggplot2. In doing so, I establish five distinct SE groups of young women that contributes to the understanding of new forms of inequality, and I discuss its implications in terms of access to educational and labour market resources.
16:10 Applications in society Frank C.S. Liu P8 The Second Wing of Polls: How Multiple Correspondence Analysis using R Advances Exploring Associated Attitudes in Smaller-Data applications Jessie Roberts NA
Polls and surveys have been used for better forecasting voter preferences and understanding consumer behavior. Academically we employ the strength of these smaller but representative data to confirm theory, including identifying associations between theoretically identified variables. However, researchers who like to explore new patterns for better understanding voters' behavior and attitudes are hardly satisfied by the current practice of survey data analysis. While we turn to bigger data, little attention has been preserved to the value of such smaller data for their potential to achieve the same goal. This talk will demonstrate how the use of "FactorMineR" package of R assists exploration of associated concepts and attitudes and patterns that could not be identified by theories in the first place. Implications for the practice of survey data collection and MCA's connection to association rules mining will be discussed.
16:30 Applications in society Meryam Krit P8 Modelling Rift Valley Fever models, applications, community/education Jessie Roberts click here
Rift Valley Fever (RVF) is one of the major viral zoonoses in Africa, affecting man and several domestic animal species. The epidemics generally involve a 5 – 15 year cycle marked by abnormally high rainfall (El Niño/Southern Oscillation phenomenon (ENSO)), but there is more and more evidence of inter-epidemic transmission.A flexible model describing RVF transmission dynamics in six species (human, domestic animal, four vectors) in three different areas will be presented. The model allows for migration, flooding, variation in climate, seasonal effects on vector egg hatching, transhumance, alternative wildlife hosts and increased susceptibility of animals.User-friendly shiny interface and optimized Rcpp implementation allow the epidemiological researchers to study different scenarios and adapt it to other situations. Application of the model to the specific situation in Tanzania and Algeria will be discussed.
15:30 Big data Miguel Gonzalez-Fierro P9 Spark on demand with AZTK big data Max Kuhn NA
Apache Spark has become the technology of choice for big data engineering. However, provisioning and maintaining Spark clusters can be challenging and expensive. To address this issue, Microsoft has developed the Azure Distributed Data Engineering Toolkit (AZTK). This talk describes how AZTK Spark clusters can be provisioned in the cloud from a local machine with just a few commands. The clusters are ready to use in under 5 minutes and come with R and R Studio Server pre-installed, allowing R users to start developing Spark applications immediately. Users can apply their own Docker image to customize the Spark environment. ATZK clusters, composed of low priority Azure virtual machines, can be created on demand and run only as needed allowing for large cost savings. We will show a short demo of how the pre-installed sparklyr package can be used to perform data engineering tasks using dplyr syntax, and machine learning using the Spark MLlib library.
15:50 Big data Benjamin Ortiz Ulloa P9 Graphs: Datastructures to Query algorithms, models, databases, networks, text analysis/NLP, big data Max Kuhn NA
When people think of graphs, they often think about mapping out social media connections. While graphs are indeed useful for mapping out social networks, they have many other practical applications. Data in the real world resemble vertices and edges more than they resemble rows and columns. This allows researchers to intuitively grasp the data modeled and stored within a graph. Graph exploration -- also known as graph traversal -- is traditionally done with a traversal language such as Gremlin or Cypher. The functionality of these traversal languages can be duplicated by combining the igraph and magrittr packages. Traversing a graph in R gives useRs access to a myriad of simple, but powerful algorithms to explore their data sets. This talk will show why data should be explored as a graph as well as show how a graph can be traversed in R. I will do this by going through a survey of different graph traversal techniques and by showing the code patterns necessary for each of those techniques.
16:10 Big data Amy Stringer Amy Stringer P9 Automated Visualisations for Big Data visualisation, reproducibility, big data Max Kuhn NA
The Catlin Seaview project is a large scale reef survey for estimating coralcover at various locations around the world. Upon re-surveying, it is possibleto track changes in, and predict the future condition of, these reefs over time.The survey collects hundreds of thousands of images from 2km transects ofreef, which are then sent to a neural network for automatic annotation of reefcommunities. Annotations are completed in such a way that the resulting datahave hierarchical spatial scales; going up from image, to transect, to reef, tosubregion, to region.Here, we present an efficient method for extracting, summarising and visu-alising the big and complex data with Rmarkdown, dplyr and ggplot2. The useof Rmarkdown, for report generation, allows for the introduction of parametersinto the construction of the document, allowing for entirely unique reports to bedeveloped from the one source script. This approach has resulted in a systemfor compiling 22 reproducible reports, extracting, summarising and visualisingdata at multiple spatial scales, from over 600 000 images, in a matter of minutes;leaving machines to do the work so that people have time to think.
16:30 Big data Snehalata Huzurbazar P9 Visualizations to guide dimension reduction for sparse high-dimensional data visualisation Max Kuhn NA
Dimension reduction for high-dimensional data is necessary for descriptive data analysis. Most researchers restrict themselves to visualizing 2 or 3 dimensions, however, to understand relationships between many variables in high-dimensional data, more dimensions are needed. This talk presents several new options for visualizing beyond 3D. These are illustrated using 16S rRNA microbiome data. We will show intensity plots developed to highlight the changing contributions of taxa (or subjects) as the number of principal components of the dimension reduction or ordination method are changed. And secondarily revive Andrews curves, connected with a tour algorithm for viewing 1D projections of multiple principal components, to study group behavior in the high-dimensional data. The plots provide a quick visualization of taxa/subjects that are close to the `center' or that contribute to dissimilarity. They also allow for exploration of patterns among related subjects or taxa not seen in other visualizations. All code is written in R and available on Github.
15:30 Stat methods for high-dim biology Florian Rohart P10 mixOmics: An R package for 'omics feature selection and multiple data integration data mining, applications, bioinformatics, multivariate, big data Julie Josse NA
The mixOmics R-package contains a suite of multivariate methods that model molecular features holistically and statistically integrate diverse types of data (e.g. ‘omics data as transcriptomics, proteomics, metabolomics) to offer an insightful picture of a biological system.Our two latest frameworks for data integration; N-integration with DIABLO combines different ‘omics datasets measured on the same N samples or individuals; P-integration with MINT combines studies measured on the same P features (e.g., genes) but from independent cohorts of individuals. Both frameworks are introduced in a discriminative context for the identification of relevant and robust molecular signatures across multiple data sets. mixOmics is a well-designed, user-friendly package with attractive graphical outputs. It represents a significant contribution to the field of computational biology which has a strong need for such toolkits to mine and integrate datasets.
15:50 Stat methods for high-dim biology Claus Ekstrøm P10 Using mommix for fast, large-scale genome-studies in the presence of gene-environment and gene-gene interaction algorithms, models, bioinformatics, big data Julie Josse NA
The majority of disorders and outcomes analysed in genome-wide association studies are believed to multi-factorial and influenced by gene-environment (GxE), gene-gene (GxG) interactions, or both. However, including GxE or GxG increases the computational burden by several orders of magnitude which makes the inclusion of interactions prohibitively cumbersome.Finite mixtures of regression models provide a flexible modeling framework for many phenomena. Using moment-based estimation of the regression parameters, we develop unbiased estimators with a minimum of assumptions on the mixture components. In particular, only the average regression model for one of the components in the mixture model is needed and no requirements on any of the distributions.We present a new R package, mommix, for moment-based mixtures of regression models, which implements this new approach for regression mixtures. We illustrate the use of the moment-based mixture of regression models with an application to genome-wide association analysis, and show that the implementation is fast, which makes large-scale genetic analysis with gene-environment and gene-gene interactions feasible.
16:10 Stat methods for high-dim biology Jacob Bergstedt P10 Quantifying the immune system with the MMI package models, data mining, applications, reproducibility, bioinformatics, interfaces Julie Josse NA
The blood composition of immune cells provide a key indicator of human health and disease. To identify the sources of variation in this composition, we combined standardized flow cytometry and a questionnaire investigating demographical factors in 816 French individuals. The study is published in the Nature immunology article “Natural variation in innate immune cell parameters is preferentially driven by genetic factors”.To facilitate the study, we developed the R package MMI (https://github.com/jacobbergstedt/mmi), which defines a framework to specify a family of models. Operations are implemented for models in the family, such as doing tests, computing confidence intervals or AIC measures and investigating residuals, the results of which are collected in a MapReduce-like pattern. The software keeps track of variables, parameter transformations, multiple testing and selective inference adjustments.With the package we release the dataset of 816 observations of 166 immune cell parameters and 44 demographical variables. We hope that this resource can be used to generate hypotheses in immunology, but also be of benefit to the broader community, in education and benchmarking.
16:30 Stat methods for high-dim biology Rudradev Sengupta P10 High Performance Computing Using R for High Dimensional Surrogacy Applications in Drug Development models, data mining, applications, bioinformatics, performance, big data Julie Josse NA
Identification of genetic biomarkers is a primary data analysis task in the context of drug discovery experiments. These experiments consist of several high dimensional datasets which contain information about a set of new drugs under development. This type of data structure introduces the challenge of multi-source data integration which is needed in order to identify the biological pathways related to the new set of drugs under development. In order to process all the information contained in the datasets, high performance computing techniques are required. Currently available R packages, for parallel computing, are not optimized for a specific setting and data structure. We proposed a new “master-slave” framework, for data analysis using R in parallel, in a computer cluster. The proposed data analysis workflow is applied to a multi-source high dimensional drug discovery dataset and a performance comparison is made between the new framework and existing R packages for parallel computing. Different configuration settings, for parallel programming in R, are presented to show that the computation time, for the specific application under consideration, can be reduced by 534.62%.
15:30 Robust methods Kasey Jones P6 rollmatch: An R Package for Rolling Entry Matching algorithms, models Adam Sparks NA
The gold standard of experimental research is the randomized-control trial. However, many healthcare interventions are implemented without a randomized control group for practical or ethical reasons. Propensity score matching (PSM) is a popular method for approximating a randomized experiment from observational data by matching members of a treatment group to similar candidates of a control group that did not receive the intervention. However, traditional PSM is not designed for studies that enroll participants on a rolling basis, a common practice in healthcare interventions where delaying treatment may impact patient health. Rolling Entry Matching (REM) is a new matching method that addresses the rolling entry problem by selecting comparison group members who are similar to intervention members with respect to both static, unchanging characteristics (e.g., race, DOB) and dynamic characteristics that change over time (e.g., health conditions, health care use). This presentation will introduce both REM and rollmatch, an R package for performing REM to assess rolling entry interventions.
15:50 Robust methods Charles T. Gray P6 varameta': Meta-analysis of medians algorithms, models, applications, reproducibility Adam Sparks NA
Meta-analyses bring together summary statistics from multiple sources; which are reported in various ways. In this talk I will introduce the `varameta` package, which will provide an underlying (and reproducible) framework for understanding skewed meta-analysis data and reporting. The `varameta` package accompanies a couple of theoretical meta-analysis papers I am working on for meta-analysis of medians. This package is also designed to be an adjunct to the well-established conventional `metafor` package. In this package I have collated the existing techniques for meta-analysing skewed data reported as medians and interquartile ranges (or ranges). The `varameta` package will also include reproducible simulation documentation (in .Rmd) of existing methods in meta-analysis benchmarked against our proposed estimator for the standard error of the sample median. In this talk I will demonstrate the package, the web interface for clinicians, as well as how it can be implemented in everyday systematic reviews.
16:10 Robust methods Sevvandi Kandanaarachchi P6 Does normalizing your data affect outlier detection? algorithms, Data pre-processing Adam Sparks click here
It is common practice to normalize data before using an outlier detection method. But which method should we use to normalize the data? Does it matter? The short answer is yes, it does. The choice of normalization method may increase or decrease the effectiveness of an outlier detection method on a given dataset. In this talk we investigate this triangular relationship between datasets, normalization methods and outlier detection methods.
16:30 Robust methods Priyanga Dilini Talagala P6 oddstream and stray: Anomaly Detection in Streaming Temporal Data with R algorithms, space/time, multivariate, streaming data, outlier detection Adam Sparks NA
This work introduces two R packages, oddstream and stray for detecting anomalous series within a large collection of time series in the context of non-stationary streaming data. In `oddstream` we define an anomaly as an observation that is very unlikely given the recent distribution of a given system. This package provides a framework that provides early detection of anomalous behaviour within a large collection of streaming time series. This includes a novel approach that adapts to non-stationarity. In `stray` we define an anomaly as an observation that deviates markedly from the majority with a large distance gap. This package provides a framework to detect anomalies in high dimensional data. Then the framework is extended to identify anomalies in streaming temporal data. The proposed algorithms use time series features as inputs, and approaches based on extreme value theory for the model building process. Using various synthetic and real datasets, we demonstrate the wide applicability and usefulness of our proposed frameworks. We show that the proposed algorithms can work well in the presence of noisy non-stationary data within multiple classes of time series.
15:30 Reproducibility John Blischak AUD The workflowr R package: a framework for reproducible and collaborative data science reproducibility Scott Came click here
The workflowr R package helps scientists organize their research in a way that promotes effective project management, reproducibility, collaboration, and sharing of results. workflowr combines literate programming (knitr and rmarkdown) and version control (Git, via git2r) to generate a website containing time-stamped, versioned, and documented results. Any R user can quickly and easily adopt workflowr, which includes four key features: (1) workflowr automatically creates a directory structure for organizing data, code, and results; (2) workflowr uses the version control system Git to track different versions of the code and results without the user needing to understand Git syntax; (3) to support reproducibility, workflowr automatically includes code version information in webpages displaying results and; (4) workflowr facilitates online Web hosting (e.g. GitHub Pages) to share results. Our goal is that workflowr will make it easier for scientists to organize and communicate reproducible research results. Documentation and source code are available at https://github.com/jdblischak/workflowr.
15:50 Reproducibility Peter Baker AUD Efficient data analysis and reporting: DRY workflows in R applications, reproducibility Scott Came NA
When analysing data for different projects, do you often find yourself repeating the same steps? Typically, these steps follow a familiar pattern of reading, cleaning, summarising, plotting and analysing data then producing a report. To aid reproducibility, naive examples using Rmarkdown are often presented. However, I routinely employ a modular approach combining GNU Make, R, Rmarkdown and/or Sweave files tracked under git. This system helps to implement a don't repeat yourself (DRY) approach and scales up well as projects become more complex.To aid automation, I have developed generic R, Rmarkdown, STATA, SAS and other pattern rules for GNU Make as well as R packages to generate a project skeleton consisting of initial directories, Makefiles, R syntax files for basic data cleaning and summaries; move data files and documents to standard directories; use codebook information to specify factors and check data; and finally initialise and add these to a local git repository. Comparisons will be made with alternate approaches such as ProjectTemplate and drake.GNU Make pattern rules and R software are available at https://github.com/petebaker.
16:10 Reproducibility Filip Krikava AUD Automated unit test generation using genthat reproducibility, testing Scott Came click here
Your package has examples and vignettes of its overall functionality but no unit tests for individual functions. Writing those is no fun. Yet, when something goes wrong, unit tests are your best tool to quickly pinpoint errors. The genthat package can generate unit tests for you in the popular testthat format. Moreover, it can also be used to create reproductions when you find a bug in someone else’s code. There, instead of generating passing test cases it will generate the smallest, purposefully failing, one.Genthat does not magically create new tests out of the blue, instead it simply extracts the smallest possible test fragments from existing code. It does that by recording the input arguments and return values of all function called by clients of your package. The generated tests concentrate on single functions and test them independently of each other. Therefore a failing test usually locates the error more precisely that a failing chunk of application code. Trying it out on random set of 1500 CRAN packages, genthat managed to reproduce 80% of all function calls, increasing the unit test coverage from 19% to 54%. In this talk we present genthat and discuss testing R code.
16:30 Reproducibility Dan Wilson AUD Practical R Workflows reproducibility, workflow Scott Came click here
Learn how R can be used to create reproducible workflows for practical use in business. As analysts and data scientists we often need to repeat our work time and time again. Sometimes this will be the exact same task, other times it may be a slight variation for another client or stakeholder. This talk will demonstrate a real-world set of workflows established at The Data Collective designed to reduce the amount of copy/paste type actions to a few function calls that get the repetitive actions out of the way, so you can focus on the important parts of your job. Find out how to overcome the challenges of a repeatable workflow and make your life easier.
15:30 Spatial data and modeling Matt Moores P7 bayesImageS: an R package for Bayesian image analysis algorithms, applications, space/time Dale Bryan-Brown NA
There are many approaches to Bayesian computation with intractable likelihoods, including the exchange algorithm, approximate Bayesian computation (ABC), thermodynamic integration, and composite likelihood. These approaches vary in accuracy as well as scalability for datasets of significant size. The Potts model is an example where such methods are required, due to its intractable normalising constant. This model is a type of Markov random field, which is commonly used for image segmentation. The dimension of its parameter space increases linearly with the number of pixels in the image, making this a challenging application for scalable Bayesian computation. My talk will introduce various algorithms in the context of the Potts model and describe their implementation in C++, using OpenMP for parallelism. I will also discuss the process of releasing this software as an open source R package on the CRAN repository.
15:50 Spatial data and modeling Jin Li P7 A new R package for spatial predictive modelling: spm models, data mining, reproducibility, space/time, performance, spatial predictive models; hybrid methods of geostatistics and machine learning; model selection and validation; predictive accuracy Dale Bryan-Brown click here
Accuracy of spatial predictions is crucial for evidence-informed environmental management and conservation. Improving the accuracy by identifying the most accurate predictive model is essential, but also challenging as the accuracy is affected by multiple factors. Recently developed hybrid methods of machine learning methods and geostatistics have shown their advantages in spatial predictive modelling in environmental sciences, with significantly improved predictive accuracy. An R package, ‘spm: Spatial Predictive Modelling’, has been developed to introduce these methods and recently released for R users. This presentation will briefly introduce spm, including: 1) spatial predictive methods, 2) new hybrid methods of geostatistical and machine learning methods, 3) assessment of predictive accuracy, 4) applications of spatial predictive models, and 5) relevant functions in spm. It will then demonstrate how to apply some functions in spm to relevant datasets and to show the resultant improvements in predictive accuracy and modelling efficiency. Although in this presentation, spm is applied to data in environmental sciences, it can also be applied to data in other relevant disciplines.
16:10 Spatial data and modeling Daniel Fryer P7 rcosmo: Statistical Analysis of the Cosmic Microwave Background visualisation, databases, space/time, big data, new R package Dale Bryan-Brown NA
The Cosmic Microwave Background (CMB) is remnant electromagnetic radiation from the epoch of recombination. It is the most ancient important source of data about the early universe and the key to unlocking the mysteries of the Big Bang and the structure of time and space. Spurred on by a wealth of satellite data, intensive investigations in the past few years have resulted in many physical and mathematical results to characterise CMB radiation. It can be modelled as a realisation of a homogeneous Gaussian random field on the sphere. But, what does any of this matter for statisticians if they cannot play with the CMB data in their favourite programming language?A new R package, rcosmo, provides easy access to the CMB data and various tools for exploring geometric and statistical properties of the CMB. This talk will be a quick introduction to rcosmo by one of its developers, followed by an invitation for discussions and suggestions.This research was supported under the Australian Research Council's Discovery Project DP160101366.
16:30 Spatial data and modeling Marek Rogala P7 Using deep learning on Satellite imagery to get a business edge visualisation, algorithms, models, applications, web app, Satellite data Dale Bryan-Brown NA
The talk is about new possibilities arising from applying deep learning to satellite imagery. Satellite data changes the game as it allows to reach information not available to business nowadays and to travel in time. Combined with deep learning techniques, it delivers unique insights that have never been available before.Using deep learning on satellite data can deliver insights no human can. Satellite data is huge and non-obvious. By being able to go back to an arbitrary time in history we can prevent frauds. We can build forecasts and observe events we wouldn’t have access to otherwise. We’ll explore a number of emerging use cases and the common traits behind them. I will show how our R department is working with satellite data and how we use Shiny to build decision support systems for business.As an example of my previous talks, here’s a link of my talk at UseR, Brussel 2017: https://channel9.msdn.com/Events/useR-international-R-User-conferences/useR-International-R-User-2017-Conference/shinycollections-Google-Docs-like-live-collaboration-in-Shiny