Talk Schedule

The talks will take place on 11-13 July 2018 (click the interested talk for its abstract). A datatable version is provided here, if you’re looking for a more easy-to-search & R-oriented format. Information for presenters is here.

Time	Session	Presenter	Venue	Title	Keywords	Chair	Slides
13:45	Keynote	Steph de Silva	AUD	Beyond syntax, towards culture: the potentiality of deep open source communities	NA	Di Cook	click here
For over twenty years, R has been a programming language under development. In that time a collection of open source communities have sprung up around it. These communities have commonalities that are developing into a distinct programming subculture. The existence of a common subculture connecting these communities is important for two reasons: the power to create value and the potential to champion values.
15:30	Applications in society	Richard Layton	P8	Data, methods, and metrics for studying student persistence	applications, community/education, persistence metrics, intersectionality, longitudinal	Jessie Roberts	click here
This paper introduces R users to data and tools for investigating undergraduate persistence metrics using the midfieldr package and its associated data packages. The data are the student records (registrar's data) of approximately 200,000 undergraduates at US institutions from 1990 to 2016. midfieldr provides functions for determining persistence metrics such as graduation rates and for grouping and displaying findings by program, institution, race/ethnicity, and sex. These packages provide an entry to this type of intersectional research for anyone with basic proficiency in R and familiarity with packages from the tidyverse. The goal of the paper is to introduce the packages and to share our data, methods, and metrics for intersectional research in student persistence.
15:50	Applications in society	Maria Holcekova	P8	The dynamic approach to inequality: Using longitudinal trajectories of young women and their parents in determining their socio-economic positions within the contemporary Western society	visualisation, clustering, imputation, longitudinal data analysis	Jessie Roberts	NA
Intensified globalisation and ensuing increased affluence of Western populations has changed the composition of traditional social class system in England. This does not imply the disappearance of socio-economic (SE) classes and inequalities, but rather their redefinition. Unfortunately, limited research has considered the dynamic nature of SE positions, especially in understanding youth transitions from parental to personal SE classes. I address this problem using nationally representative longitudinal data in the Next Steps 1990 youth cohort study in England. Firstly, I explore the parental transition patterns using longCatPlot. Secondly, I visualise missing data through the missmap function in Amelia and impute these values using random forests in missForest. Thirdly, I employ the daisy function within the cluster to create SE groups based on Gower distance, partitioning around medoids, and silhouette width. Finally, I visualise these results using ggplot2. In doing so, I establish five distinct SE groups of young women that contributes to the understanding of new forms of inequality, and I discuss its implications in terms of access to educational and labour market resources.
16:10	Applications in society	Frank C.S. Liu	P8	The Second Wing of Polls: How Multiple Correspondence Analysis using R Advances Exploring Associated Attitudes in Smaller-Data	applications	Jessie Roberts	NA
Polls and surveys have been used for better forecasting voter preferences and understanding consumer behavior. Academically we employ the strength of these smaller but representative data to confirm theory, including identifying associations between theoretically identified variables. However, researchers who like to explore new patterns for better understanding voters' behavior and attitudes are hardly satisfied by the current practice of survey data analysis. While we turn to bigger data, little attention has been preserved to the value of such smaller data for their potential to achieve the same goal. This talk will demonstrate how the use of "FactorMineR" package of R assists exploration of associated concepts and attitudes and patterns that could not be identified by theories in the first place. Implications for the practice of survey data collection and MCA's connection to association rules mining will be discussed.
16:30	Applications in society	Meryam Krit	P8	Modelling Rift Valley Fever	models, applications, community/education	Jessie Roberts	click here
Rift Valley Fever (RVF) is one of the major viral zoonoses in Africa, affecting man and several domestic animal species. The epidemics generally involve a 5 – 15 year cycle marked by abnormally high rainfall (El Niño/Southern Oscillation phenomenon (ENSO)), but there is more and more evidence of inter-epidemic transmission.A flexible model describing RVF transmission dynamics in six species (human, domestic animal, four vectors) in three different areas will be presented. The model allows for migration, flooding, variation in climate, seasonal effects on vector egg hatching, transhumance, alternative wildlife hosts and increased susceptibility of animals.User-friendly shiny interface and optimized Rcpp implementation allow the epidemiological researchers to study different scenarios and adapt it to other situations. Application of the model to the specific situation in Tanzania and Algeria will be discussed.
15:30	Big data	Miguel Gonzalez-Fierro	P9	Spark on demand with AZTK	big data	Max Kuhn	NA
Apache Spark has become the technology of choice for big data engineering. However, provisioning and maintaining Spark clusters can be challenging and expensive. To address this issue, Microsoft has developed the Azure Distributed Data Engineering Toolkit (AZTK). This talk describes how AZTK Spark clusters can be provisioned in the cloud from a local machine with just a few commands. The clusters are ready to use in under 5 minutes and come with R and R Studio Server pre-installed, allowing R users to start developing Spark applications immediately. Users can apply their own Docker image to customize the Spark environment. ATZK clusters, composed of low priority Azure virtual machines, can be created on demand and run only as needed allowing for large cost savings. We will show a short demo of how the pre-installed sparklyr package can be used to perform data engineering tasks using dplyr syntax, and machine learning using the Spark MLlib library.
15:50	Big data	Benjamin Ortiz Ulloa	P9	Graphs: Datastructures to Query	algorithms, models, databases, networks, text analysis/NLP, big data	Max Kuhn	NA
When people think of graphs, they often think about mapping out social media connections. While graphs are indeed useful for mapping out social networks, they have many other practical applications. Data in the real world resemble vertices and edges more than they resemble rows and columns. This allows researchers to intuitively grasp the data modeled and stored within a graph. Graph exploration -- also known as graph traversal -- is traditionally done with a traversal language such as Gremlin or Cypher. The functionality of these traversal languages can be duplicated by combining the igraph and magrittr packages. Traversing a graph in R gives useRs access to a myriad of simple, but powerful algorithms to explore their data sets. This talk will show why data should be explored as a graph as well as show how a graph can be traversed in R. I will do this by going through a survey of different graph traversal techniques and by showing the code patterns necessary for each of those techniques.
16:10	Big data	Amy Stringer Amy Stringer	P9	Automated Visualisations for Big Data	visualisation, reproducibility, big data	Max Kuhn	NA
The Catlin Seaview project is a large scale reef survey for estimating coralcover at various locations around the world. Upon re-surveying, it is possibleto track changes in, and predict the future condition of, these reefs over time.The survey collects hundreds of thousands of images from 2km transects ofreef, which are then sent to a neural network for automatic annotation of reefcommunities. Annotations are completed in such a way that the resulting datahave hierarchical spatial scales; going up from image, to transect, to reef, tosubregion, to region.Here, we present an efficient method for extracting, summarising and visu-alising the big and complex data with Rmarkdown, dplyr and ggplot2. The useof Rmarkdown, for report generation, allows for the introduction of parametersinto the construction of the document, allowing for entirely unique reports to bedeveloped from the one source script. This approach has resulted in a systemfor compiling 22 reproducible reports, extracting, summarising and visualisingdata at multiple spatial scales, from over 600 000 images, in a matter of minutes;leaving machines to do the work so that people have time to think.
16:30	Big data	Snehalata Huzurbazar	P9	Visualizations to guide dimension reduction for sparse high-dimensional data	visualisation	Max Kuhn	NA
Dimension reduction for high-dimensional data is necessary for descriptive data analysis. Most researchers restrict themselves to visualizing 2 or 3 dimensions, however, to understand relationships between many variables in high-dimensional data, more dimensions are needed. This talk presents several new options for visualizing beyond 3D. These are illustrated using 16S rRNA microbiome data. We will show intensity plots developed to highlight the changing contributions of taxa (or subjects) as the number of principal components of the dimension reduction or ordination method are changed. And secondarily revive Andrews curves, connected with a tour algorithm for viewing 1D projections of multiple principal components, to study group behavior in the high-dimensional data. The plots provide a quick visualization of taxa/subjects that are close to the `center' or that contribute to dissimilarity. They also allow for exploration of patterns among related subjects or taxa not seen in other visualizations. All code is written in R and available on Github.
15:30	Stat methods for high-dim biology	Florian Rohart	P10	mixOmics: An R package for 'omics feature selection and multiple data integration	data mining, applications, bioinformatics, multivariate, big data	Julie Josse	NA
The mixOmics R-package contains a suite of multivariate methods that model molecular features holistically and statistically integrate diverse types of data (e.g. ‘omics data as transcriptomics, proteomics, metabolomics) to offer an insightful picture of a biological system.Our two latest frameworks for data integration; N-integration with DIABLO combines different ‘omics datasets measured on the same N samples or individuals; P-integration with MINT combines studies measured on the same P features (e.g., genes) but from independent cohorts of individuals. Both frameworks are introduced in a discriminative context for the identification of relevant and robust molecular signatures across multiple data sets. mixOmics is a well-designed, user-friendly package with attractive graphical outputs. It represents a significant contribution to the field of computational biology which has a strong need for such toolkits to mine and integrate datasets.
15:50	Stat methods for high-dim biology	Claus Ekstrøm	P10	Using mommix for fast, large-scale genome-studies in the presence of gene-environment and gene-gene interaction	algorithms, models, bioinformatics, big data	Julie Josse	NA
The majority of disorders and outcomes analysed in genome-wide association studies are believed to multi-factorial and influenced by gene-environment (GxE), gene-gene (GxG) interactions, or both. However, including GxE or GxG increases the computational burden by several orders of magnitude which makes the inclusion of interactions prohibitively cumbersome.Finite mixtures of regression models provide a flexible modeling framework for many phenomena. Using moment-based estimation of the regression parameters, we develop unbiased estimators with a minimum of assumptions on the mixture components. In particular, only the average regression model for one of the components in the mixture model is needed and no requirements on any of the distributions.We present a new R package, mommix, for moment-based mixtures of regression models, which implements this new approach for regression mixtures. We illustrate the use of the moment-based mixture of regression models with an application to genome-wide association analysis, and show that the implementation is fast, which makes large-scale genetic analysis with gene-environment and gene-gene interactions feasible.
16:10	Stat methods for high-dim biology	Jacob Bergstedt	P10	Quantifying the immune system with the MMI package	models, data mining, applications, reproducibility, bioinformatics, interfaces	Julie Josse	NA
The blood composition of immune cells provide a key indicator of human health and disease. To identify the sources of variation in this composition, we combined standardized flow cytometry and a questionnaire investigating demographical factors in 816 French individuals. The study is published in the Nature immunology article “Natural variation in innate immune cell parameters is preferentially driven by genetic factors”.To facilitate the study, we developed the R package MMI (https://github.com/jacobbergstedt/mmi), which defines a framework to specify a family of models. Operations are implemented for models in the family, such as doing tests, computing confidence intervals or AIC measures and investigating residuals, the results of which are collected in a MapReduce-like pattern. The software keeps track of variables, parameter transformations, multiple testing and selective inference adjustments.With the package we release the dataset of 816 observations of 166 immune cell parameters and 44 demographical variables. We hope that this resource can be used to generate hypotheses in immunology, but also be of benefit to the broader community, in education and benchmarking.
16:30	Stat methods for high-dim biology	Rudradev Sengupta	P10	High Performance Computing Using R for High Dimensional Surrogacy Applications in Drug Development	models, data mining, applications, bioinformatics, performance, big data	Julie Josse	NA
Identification of genetic biomarkers is a primary data analysis task in the context of drug discovery experiments. These experiments consist of several high dimensional datasets which contain information about a set of new drugs under development. This type of data structure introduces the challenge of multi-source data integration which is needed in order to identify the biological pathways related to the new set of drugs under development. In order to process all the information contained in the datasets, high performance computing techniques are required. Currently available R packages, for parallel computing, are not optimized for a specific setting and data structure. We proposed a new “master-slave” framework, for data analysis using R in parallel, in a computer cluster. The proposed data analysis workflow is applied to a multi-source high dimensional drug discovery dataset and a performance comparison is made between the new framework and existing R packages for parallel computing. Different configuration settings, for parallel programming in R, are presented to show that the computation time, for the specific application under consideration, can be reduced by 534.62%.
15:30	Robust methods	Kasey Jones	P6	rollmatch: An R Package for Rolling Entry Matching	algorithms, models	Adam Sparks	NA
The gold standard of experimental research is the randomized-control trial. However, many healthcare interventions are implemented without a randomized control group for practical or ethical reasons. Propensity score matching (PSM) is a popular method for approximating a randomized experiment from observational data by matching members of a treatment group to similar candidates of a control group that did not receive the intervention. However, traditional PSM is not designed for studies that enroll participants on a rolling basis, a common practice in healthcare interventions where delaying treatment may impact patient health. Rolling Entry Matching (REM) is a new matching method that addresses the rolling entry problem by selecting comparison group members who are similar to intervention members with respect to both static, unchanging characteristics (e.g., race, DOB) and dynamic characteristics that change over time (e.g., health conditions, health care use). This presentation will introduce both REM and rollmatch, an R package for performing REM to assess rolling entry interventions.
15:50	Robust methods	Charles T. Gray	P6	varameta': Meta-analysis of medians	algorithms, models, applications, reproducibility	Adam Sparks	NA
Meta-analyses bring together summary statistics from multiple sources; which are reported in various ways. In this talk I will introduce the `varameta` package, which will provide an underlying (and reproducible) framework for understanding skewed meta-analysis data and reporting. The `varameta` package accompanies a couple of theoretical meta-analysis papers I am working on for meta-analysis of medians. This package is also designed to be an adjunct to the well-established conventional `metafor` package. In this package I have collated the existing techniques for meta-analysing skewed data reported as medians and interquartile ranges (or ranges). The `varameta` package will also include reproducible simulation documentation (in .Rmd) of existing methods in meta-analysis benchmarked against our proposed estimator for the standard error of the sample median. In this talk I will demonstrate the package, the web interface for clinicians, as well as how it can be implemented in everyday systematic reviews.
16:10	Robust methods	Sevvandi Kandanaarachchi	P6	Does normalizing your data affect outlier detection?	algorithms, Data pre-processing	Adam Sparks	click here
It is common practice to normalize data before using an outlier detection method. But which method should we use to normalize the data? Does it matter? The short answer is yes, it does. The choice of normalization method may increase or decrease the effectiveness of an outlier detection method on a given dataset. In this talk we investigate this triangular relationship between datasets, normalization methods and outlier detection methods.
16:30	Robust methods	Priyanga Dilini Talagala	P6	oddstream and stray: Anomaly Detection in Streaming Temporal Data with R	algorithms, space/time, multivariate, streaming data, outlier detection	Adam Sparks	NA
This work introduces two R packages, oddstream and stray for detecting anomalous series within a large collection of time series in the context of non-stationary streaming data. In `oddstream` we define an anomaly as an observation that is very unlikely given the recent distribution of a given system. This package provides a framework that provides early detection of anomalous behaviour within a large collection of streaming time series. This includes a novel approach that adapts to non-stationarity. In `stray` we define an anomaly as an observation that deviates markedly from the majority with a large distance gap. This package provides a framework to detect anomalies in high dimensional data. Then the framework is extended to identify anomalies in streaming temporal data. The proposed algorithms use time series features as inputs, and approaches based on extreme value theory for the model building process. Using various synthetic and real datasets, we demonstrate the wide applicability and usefulness of our proposed frameworks. We show that the proposed algorithms can work well in the presence of noisy non-stationary data within multiple classes of time series.
15:30	Reproducibility	John Blischak	AUD	The workflowr R package: a framework for reproducible and collaborative data science	reproducibility	Scott Came	click here
The workflowr R package helps scientists organize their research in a way that promotes effective project management, reproducibility, collaboration, and sharing of results. workflowr combines literate programming (knitr and rmarkdown) and version control (Git, via git2r) to generate a website containing time-stamped, versioned, and documented results. Any R user can quickly and easily adopt workflowr, which includes four key features: (1) workflowr automatically creates a directory structure for organizing data, code, and results; (2) workflowr uses the version control system Git to track different versions of the code and results without the user needing to understand Git syntax; (3) to support reproducibility, workflowr automatically includes code version information in webpages displaying results and; (4) workflowr facilitates online Web hosting (e.g. GitHub Pages) to share results. Our goal is that workflowr will make it easier for scientists to organize and communicate reproducible research results. Documentation and source code are available at https://github.com/jdblischak/workflowr.
15:50	Reproducibility	Peter Baker	AUD	Efficient data analysis and reporting: DRY workflows in R	applications, reproducibility	Scott Came	NA
When analysing data for different projects, do you often find yourself repeating the same steps? Typically, these steps follow a familiar pattern of reading, cleaning, summarising, plotting and analysing data then producing a report. To aid reproducibility, naive examples using Rmarkdown are often presented. However, I routinely employ a modular approach combining GNU Make, R, Rmarkdown and/or Sweave files tracked under git. This system helps to implement a don't repeat yourself (DRY) approach and scales up well as projects become more complex.To aid automation, I have developed generic R, Rmarkdown, STATA, SAS and other pattern rules for GNU Make as well as R packages to generate a project skeleton consisting of initial directories, Makefiles, R syntax files for basic data cleaning and summaries; move data files and documents to standard directories; use codebook information to specify factors and check data; and finally initialise and add these to a local git repository. Comparisons will be made with alternate approaches such as ProjectTemplate and drake.GNU Make pattern rules and R software are available at https://github.com/petebaker.
16:10	Reproducibility	Filip Krikava	AUD	Automated unit test generation using genthat	reproducibility, testing	Scott Came	click here
Your package has examples and vignettes of its overall functionality but no unit tests for individual functions. Writing those is no fun. Yet, when something goes wrong, unit tests are your best tool to quickly pinpoint errors. The genthat package can generate unit tests for you in the popular testthat format. Moreover, it can also be used to create reproductions when you find a bug in someone else’s code. There, instead of generating passing test cases it will generate the smallest, purposefully failing, one.Genthat does not magically create new tests out of the blue, instead it simply extracts the smallest possible test fragments from existing code. It does that by recording the input arguments and return values of all function called by clients of your package. The generated tests concentrate on single functions and test them independently of each other. Therefore a failing test usually locates the error more precisely that a failing chunk of application code. Trying it out on random set of 1500 CRAN packages, genthat managed to reproduce 80% of all function calls, increasing the unit test coverage from 19% to 54%. In this talk we present genthat and discuss testing R code.
16:30	Reproducibility	Dan Wilson	AUD	Practical R Workflows	reproducibility, workflow	Scott Came	click here
Learn how R can be used to create reproducible workflows for practical use in business. As analysts and data scientists we often need to repeat our work time and time again. Sometimes this will be the exact same task, other times it may be a slight variation for another client or stakeholder. This talk will demonstrate a real-world set of workflows established at The Data Collective designed to reduce the amount of copy/paste type actions to a few function calls that get the repetitive actions out of the way, so you can focus on the important parts of your job. Find out how to overcome the challenges of a repeatable workflow and make your life easier.
15:30	Spatial data and modeling	Matt Moores	P7	bayesImageS: an R package for Bayesian image analysis	algorithms, applications, space/time	Dale Bryan-Brown	NA
There are many approaches to Bayesian computation with intractable likelihoods, including the exchange algorithm, approximate Bayesian computation (ABC), thermodynamic integration, and composite likelihood. These approaches vary in accuracy as well as scalability for datasets of significant size. The Potts model is an example where such methods are required, due to its intractable normalising constant. This model is a type of Markov random field, which is commonly used for image segmentation. The dimension of its parameter space increases linearly with the number of pixels in the image, making this a challenging application for scalable Bayesian computation. My talk will introduce various algorithms in the context of the Potts model and describe their implementation in C++, using OpenMP for parallelism. I will also discuss the process of releasing this software as an open source R package on the CRAN repository.
15:50	Spatial data and modeling	Jin Li	P7	A new R package for spatial predictive modelling: spm	models, data mining, reproducibility, space/time, performance, spatial predictive models; hybrid methods of geostatistics and machine learning; model selection and validation; predictive accuracy	Dale Bryan-Brown	click here
Accuracy of spatial predictions is crucial for evidence-informed environmental management and conservation. Improving the accuracy by identifying the most accurate predictive model is essential, but also challenging as the accuracy is affected by multiple factors. Recently developed hybrid methods of machine learning methods and geostatistics have shown their advantages in spatial predictive modelling in environmental sciences, with significantly improved predictive accuracy. An R package, ‘spm: Spatial Predictive Modelling’, has been developed to introduce these methods and recently released for R users. This presentation will briefly introduce spm, including: 1) spatial predictive methods, 2) new hybrid methods of geostatistical and machine learning methods, 3) assessment of predictive accuracy, 4) applications of spatial predictive models, and 5) relevant functions in spm. It will then demonstrate how to apply some functions in spm to relevant datasets and to show the resultant improvements in predictive accuracy and modelling efficiency. Although in this presentation, spm is applied to data in environmental sciences, it can also be applied to data in other relevant disciplines.
16:10	Spatial data and modeling	Daniel Fryer	P7	rcosmo: Statistical Analysis of the Cosmic Microwave Background	visualisation, databases, space/time, big data, new R package	Dale Bryan-Brown	NA
The Cosmic Microwave Background (CMB) is remnant electromagnetic radiation from the epoch of recombination. It is the most ancient important source of data about the early universe and the key to unlocking the mysteries of the Big Bang and the structure of time and space. Spurred on by a wealth of satellite data, intensive investigations in the past few years have resulted in many physical and mathematical results to characterise CMB radiation. It can be modelled as a realisation of a homogeneous Gaussian random field on the sphere. But, what does any of this matter for statisticians if they cannot play with the CMB data in their favourite programming language?A new R package, rcosmo, provides easy access to the CMB data and various tools for exploring geometric and statistical properties of the CMB. This talk will be a quick introduction to rcosmo by one of its developers, followed by an invitation for discussions and suggestions.This research was supported under the Australian Research Council's Discovery Project DP160101366.
16:30	Spatial data and modeling	Marek Rogala	P7	Using deep learning on Satellite imagery to get a business edge	visualisation, algorithms, models, applications, web app, Satellite data	Dale Bryan-Brown	NA
The talk is about new possibilities arising from applying deep learning to satellite imagery. Satellite data changes the game as it allows to reach information not available to business nowadays and to travel in time. Combined with deep learning techniques, it delivers unique insights that have never been available before.Using deep learning on satellite data can deliver insights no human can. Satellite data is huge and non-obvious. By being able to go back to an arbitrary time in history we can prevent frauds. We can build forecasts and observe events we wouldn’t have access to otherwise. We’ll explore a number of emerging use cases and the common traits behind them. I will show how our R department is working with satellite data and how we use Shiny to build decision support systems for business.As an example of my previous talks, here’s a link of my talk at UseR, Brussel 2017: https://channel9.msdn.com/Events/useR-international-R-User-conferences/useR-International-R-User-2017-Conference/shinycollections-Google-Docs-like-live-collaboration-in-Shiny

Time	Session	Presenter	Venue	Title	Keywords	Chair	Slides
9:05	Keynote	Thomas Lin Pedersen	AUD	The Grammar of Animation	NA	Rob J Hyndman	click here
In the world of data visualisation much work has been put into defining a grammar for both static and interactive graphics. These efforts has often been coupled to the development of visualisation frameworks where the grammar has been reflected in the API design. Less attention has been devoted to a grammar of animation, and subsequently animation frameworks has often missed the breadth and composability that are the hallmark of grammar-driven visualisation frameworks. In this talk I will justify and present a grammar of animation and position it in relation to graphics and interactivity grammar, thus creating a clear division of responsibility between the three domains. I will present an R implementation of the grammar of animation which builds on top of the ggplot2 framework and made available as the gganimate package, Using examples with gganimate I'll show how the proposed grammar can be used to break down, and reason about, animated data visualisation, and how the grammar succinctly can describe very diverse animation operations.
10:30	Applications in health and environment	Mark Padgham	P8	tRansport tools for the World (Health Organization)	applications, reproducibility, community/education, space/time, big data	Paula Andrea	NA
The World Health Organization (WHO) contracted us to provide actionable evidence for the redesign of urban transport policies to help rather than hinder human health. That means more active transport. Designing cost-effective policies to get people walking and cycling requires insight into where, when, how, and why people currently travel. This is challenging, especially in cities with limited resources, data, or analysis capabilities. We briefly describe some technical details of our 'Active Transport Toolkit' (ATT), but the primary focus will be the context that led to the WHO contract and where we plan to go next. We argue that useRs are well-placed to provide openly available, global-scale, transparent tools for policy making. It was the flexibility of the R language and the supportiveness of its community - notably including ROpenSci, which hosts two of our packages - that enabled us to develop the ATT in a way that makes it flexible enough to capture citys' unique characteristics while providing a consistent user interface. The talk will conclude with a outline of lessons learned from the perspective of others wanting to create R tools to inform policy.
10:50	Applications in health and environment	Philip Dyer	P8	Models of global marine biodiversity: an exercise in mixing R markdown, parallel processing and caching on supercomputers	models, applications, reproducibility, performance, big data	Paula Andrea	click here
R has become the standard language in ecology for statistics and modelling. If a technique has been published in mathematical ecology it has an R package. Even the data sets have an R package! The size of data sets in ecology has been growing to the point where global analysis of ecological data can be considered. At the same time powerful statistical techniques that rely on randomly permuting the data, such as bootstrapping, have become more popular. These are exciting times, but how do we get R to process our large data sets with computationally expensive algorithms without waiting forever to get results? For those new to R, or at least new to big data in R, I have some tips, techniques and packages to help you get going. I have benefited from using R markdown and Knitr to make short transcript files. I have also made use of caching to avoid recalculating big models and using parallel processing to calculate the models faster in the first place.
11:10	Applications in health and environment	Chris Hansen	P8	Enabling Analysts: Embracing R in a National Statistics Office	Official Statistics	Paula Andrea	NA
Stats NZ has recently adopted R as an approved analytical tool, and more recently for use in production of official outputs. Since adoption, R has had significant uptake, and has been a great enabler for analysts. R is more expressive and flexible than the existing tools, allowing them to more easily solve a variety of problems. R is deployed on powerful servers, so users have a generous supply of memory and cores, meaning large datasets can be handled, and long-running computations parallelised. Analysts access R using R Studio Server, and this IDE itself has had a number of positive impacts--the use of RStudio projects and R markdown documents in particular helps analysts work in a more organised way, and ensures work is reproducible. Our statistical platforms can now also use R. This is done via OpenCPU which enables remote exection of function via an HTTP API. That is, OpenCPU can be used to call functions in internally developed packages as web services. This has proven useful as we transition to a more service oriented architecture. In this talk we describe the R environment at Stats NZ, it’s implications for analysts, and provide examples of its use in practice.
11:30	Applications in health and environment	Tracy Huang	P8	Developing an Uncertainty Toolbox for Agriculture: a closer look at Sensitivity Analysis	visualisation, applications, web app, space/time, big data, R6 and Reference Classes	Paula Andrea	NA
Digiscape is one of 8 Future Science Platforms in CSIRO focussed on delivering new analytics in the digital age to better inform agricultural systems in the face of uncertainty. The Uncertainty Toolbox is one of 15 projects within Digiscape trying to make a difference to the way models are interpreted, reported and communicated in practice for decision-making. Uncertainty is front and centre of every modelling problem but it is sometimes difficult to quantify and challenging to communicate. The Sensitivity Analysis workflow focuses on developing a general framework for sensitivity analysis to inform the modeller about key parameters of interest and refine the model so it can be used in a robust way to make predictions and forecasts with uncertainties. We focus on methods applicable for large scale, non-monotonic problems that develop variance based approaches to sensitivity analysis using emulators. As such, the framework for developing this workflow in R becomes important for transparency and usability. We will outline the design steps for constructing this workflow using the latest object oriented systems available in R and give a demonstration of the tool using Shiny.
10:30	Models and methods for biology and beyond	Zachary Foster	P9	Taxa and metacoder: R packages for parsing, visualization, and manipulation of taxonomic data	visualisation, data mining, applications, databases, bioinformatics, Taxonomy	Anna Quaglieri	NA
Modern microbiome research is producing datasets that are difficult to manipulate and visualize due to the hierarchical nature of taxonomic classifications. The “taxa” package provides a set of classes for the storage and manipulation of taxonomic data. Classes range from simple building blocks to project-level objects storing multiple user-defined datasets mapped to a taxonomy. It includes parsers that can read in taxonomic information in nearly any form. It also provides functions modeled after dplyr for manipulating a taxonomy and associated datasets such that hierarchical relationships between taxa as well as mappings between taxa and data are preserved. We hope taxa will provide a basis for an ecosystem of compatible packages. We have also developed the metacoder package for visualizing hierarchical data. Metacoder implements a novel visualization called heat trees that use the color and size of nodes and edges on a taxonomic tree to quantitatively depict up to 4 statistics. This allows for rapid exploration of data and information-dense, publication-quality graphics. This is an alternative to the stacked barcharts typically used in microbiome research.
10:50	Models and methods for biology and beyond	Saswati Saha	P9	Multiple testing approaches for evaluating the effectiveness of a drug combination in a multiple-dose factorial design.	applications, multivariate, Factorial Design, Drug Combination	Anna Quaglieri	NA
Drug combination trials are often motivated from the fact that using existing drugs in combination might prove to be more productive than the existing drug alone and less expensive than producing an entirely new drug. Several approaches have been explored for developing statistical methods that compare fixed (single) dose combinations to its component. However, the extension of these approaches to a multiple dose combination clinical trial is not always so simple. Considering these facts we have proposed three approaches by which we can provide confirmatory assurance that combination of two or more drugs is more effective than the component drug alone. These approaches involved multiple comparisons in multilevel factorial design where the type 1 error is controlled by bonferroni test, bootstrap test, and a union intersection test where the least favorable null configuration has been considered. We have also built a R package implementing the above approaches and in this presentation we would like to demonstrate how this R package can be used in a drug combination trial. We will also demonstrate how these three approaches are performing when benchmarked with an existing approach.
11:10	Models and methods for biology and beyond	Bill Lattner	P9	Modeling Heterogeneous Treatment Effects with R	models, applications	Anna Quaglieri	click here
Randomized experiments have become ubiquitous in many fields. Traditionally, we have focused on reporting the average treatment effect (ATE) from such experiments. With recent advances in machine learning, and the overall scale at which experiments are now conducted, we can broaden our analysis to include heterogeneous treatment effects. This provides a more nuanced view of the effect of a treatment or change on the outcome of interest. Going one step further, we can use models of heterogeneous treatment effects to optimally allocate treatment.In this talk will provide a brief overview of heterogeneous treatment effect modeling. We will show how to apply some recently proposed methods using R, and compare the results of each using a question wording experiment from the General Social Survey. Finally, we will conclude with some practical issues in modeling heterogeneous treatment effects, including model selection and obtaining valid confidence intervals.
11:30	Models and methods for biology and beyond	Shian Su	P9	Glimma: interactive graphics for gene expression analysis	visualisation, applications, bioinformatics	Anna Quaglieri	NA
Modern RNA sequencing produces large amounts of data containing tens of thousands of genes. Exploratory and statistical analysis of these genes produces plots or tables with many data points. Glimma is a Bioconductor package that provides interactive versions of common plots from limma, a widely used gene expression analysis package. It allows researchers to explore the statistical summary of their data, with cross-chart interactions providing greater insight into the behaviours of specific genes. Interactivity allows genes of interest to be quickly interrogated on the summary graphic which provides better context than searching through spreadsheets. Cross-chart interactions display useful additional content that would otherwise require manual querying. Glimma produces HTML pages with custom D3 Javascript that handles interactions completely independent from R, the resulting plots to easily be shared with researchers without the need for software dependencies beyond a modern browser.
10:30	Learning and teaching	François Michonneau	P7	Lessons learned from developing R-based curricula across disciplines	community/education	Sam Clifford	click here
The Carpentries is a non-profit volunteer organization that teaches scientists with no or little programming experience foundational skills in coding, data science, and best-practices for reproducible research. We offer 2-day workshops for a variety of disciplines including Ecology, Genomics, Geospatial analysis, and Social Sciences. With 1300+ instructors who have taught 500+ workshops on all continents, we worked with our community of instructors to assemble evidence-based curricula using results from research on teaching and learning. We have developed detailed short- and long-term assessments to evaluate the effectiveness and level of satisfaction of our learners after attending a workshop, as well as the impact on their research and careers 6 months or more afterwards. We find that workshop participants program more often, are more confident, and use programming practices that the report make them more efficient and reproducible. Here, we will present the lessons we learned about developing curricula based on teaching R to novices across diverse disciplines, and the strategies we use to instill the desire to continue learning after attending our workshops.
10:50	Learning and teaching	Matthias Gehrke	P7	Student Performance and Acceptance of Technology in a Statistics Course Based on R mosaic - Results from a Pre- and Post-Test Survey	community/education, teaching	Sam Clifford	click here
In the last years there is movement towards simulation-based inference (e.g., bootstrapping and randomization tests) in order to improve students' understanding of statistical reasoning (see e.g. Chance et al. 2016). The R package mosaic was developed with a "minimal R" approach to simplify the introduction of these concepts (Pruim et al. (2017)). With a pre and post survey we analysed whether students improved in understanding as well as in acceptance of R during a one semester statistics course in economically related Bachelor and Master programs. These courses were held by different lecturers at multiple locations in Germany. At our private university of applied sciences for professionals studying while working the use of R is compulsory in all statistical courses.While conceptual understanding was evaluated by a subset of the modified CAOS inventory (like Chance et al. (2016)) the acceptance and use of technology was collected by using an adopted version of UTAUT2 (Venkatesh et al. (2012)).
11:10	Learning and teaching	Mette Langaas	P7	Teaching statistics - with all learning resources written in R Markdown	community/education, teaching	Sam Clifford	click here
In applied courses in statistics it is important for the student to see a mix of theory, practical examples and data analyses. Being able to study the R code used to produce the data analyses, and to run and modify the R code will give the student hands on experience, which again may lead to increased theoretical understanding.I will tell about my experiences with producing and using learning material written in R Markdown in two courses in statistics at the Norwegian University of Science and Technology. One course is at the master level (Generalized linear models) with few students (35) and a mix of plenary and interactive lectures. The other course is at the bachelor level (Statistical learning) with more students (70).
10:30	Data handling	Chester Ismay	AUD	Statistical Inference: A Tidy Approach using R	visualisation, community/education, statistical inference, tidyverse community	Jenny Bryan	click here
How do you code-up a permutation test in R? What about an ANOVA or a chi-square test? Have you ever been uncertain as to exactly which type of test you should run given the data and questions asked? The `infer` R package was created to unite common statistical inference tasks into an expressive and intuitive framework to alleviate some of these struggles and make inference more intuitive. This talk will focus on developing an understanding of the design principles of the package, which are firmly motivated by Hadley Wickham's tidy tools manifesto. It will also discuss the implementation, centered on the common conceptual threads that link a surprising range of hypothesis tests and confidence intervals. Lastly, we'll dive into some examples of how to implement the code of the `infer` package via different data sets and variable scenarios. The package is aimed to be useful to new students of statistics as well as seasoned practitioners.
10:50	Data handling	Thomas Lumley	AUD	Subsampling and one-step polishing for generalised linear models	algorithms, models, databases, big data	Jenny Bryan	NA
Using only a commodity laptop it's possible to fit a generalised linear model to a dataset from about a million to a billion rows by first fitting to a subset and then doing a one-step update. The method depends on a bit of asymptotic theory, some sampling, the Fisher scoring algorithm, efficient R-database interfaces, and a little of the tidyverse.
11:10	Data handling	James Hester	AUD	Glue strings to data in R	Package development	Jenny Bryan	NA
String interpolation, evaluating a variable name to a value within a string, isa feature of many programming languages including Python, Julia, Javascript,Rust, and most Unix Shells. R's `sprintf()` and `paste()` functions providesome of this functionality, but have limitations which make them cumbersome touse. There are also some existing add on packages with similar functionality,however each has drawbacks.The glue package performs robust stringinterpolation for R. This includes evaluation of variables and arbitrary R code,with a clean and simple syntax. Because it is dependency-free, it is easy toincorporate into packages. In addition, glue provides an extensible interfaceto perform more complex transformations; such as `glue_sql()` to constructSQL queries with automatically quoted variables.This talk will show how to utilize glue to write beautiful code which iseasy to read, write and maintain. We will also discuss ways to best use glue whenperformance is a concern. Finally we will create custom glue functions tailoredtowards specific use cases, such as JSON construction, colored messages, emojiinterpolation and more.
11:30	Data handling	Max Kuhn	AUD	Data Preprocessing using Recipes	algorithms, models	Jenny Bryan	click here
The recipes package can be used as a replacement for model.matrix as well as a general feature engineering tool. The package uses a dplyr-like syntax where a specification for a sequence of data preprocessing steps are created with the execution of these steps deferred until later. Data processing recipes can be created sequentially and intermediate results can be cached. An example is used to illustrate the basic recipe functionality and philosophy.
10:30	Statistical modeling	John Fox	P10	New Features in the car and effects Packages	visualisation, models	Matteo Fasiolo	click here
The widely used car and effects packages are associated with Foxand Weisberg, An R Companion to Applied Regression, the thirdedition of which will be published this year. In preparation, wehave released the substantially revised version 3.0-0 of the carpackage and version 4.0-1 of the effects package.The car package focuses on tools, many of them graphical, that areuseful for applied regression analysis (linear, generalized linear, mixed-effects models, etc.), including tools for preparing, examining, and transformingdata prior to specification of a regression model, and tools thatare useful for assessing regression models that have been fit todata. The effects packages focuses on graphical methods forinterpreting regression models that have been fit to data.Among the many changes and improvements to the packages are areconceptualization of effect displays, which we call "predictoreffects"; the ability to add partial residuals to effect plots ofarbitrary complexity; simplification to the arguments of plottingfunctions; new and improved functions for summarizing and testingstatistical models; and improved methods for selecting variabletransformations.
10:50	Statistical modeling	Rainer Hirk	P10	mvord: An R Package for Fitting Multivariate Ordinal Regression Models	algorithms, models, applications, multivariate	Matteo Fasiolo	NA
The R package mvord implements composite likelihood estimation in the class of multivariate ordinal regression models with probit and a logit link. A flexible modeling framework for multiple ordinal measurements on the same subject is set up, which takes into consideration the dependence among the multiple observations by employing different error structures. Heterogeneity in the error structure across the subjects can be accounted for by the package, which allows for covariate dependent error structures. In addition, regression coefficients and threshold parameters are varying across the multiple response dimensions in the default implementation. However, constraints can be defined by the user if areduction of the parameter space is desired. The proposed multivariate framework is illustrated by means of a credit risk application.
11:10	Statistical modeling	Joachim Schwarz	P10	Partial Least Squares with formative constructs and a binary target variable	PLS, pslpm package, formative constructs, binary target variable	Matteo Fasiolo	NA
During the last years, the use of PLS became more and more important for the modelling of dependencies between latent variables as an alternative to classical structural equation modelling. However, a non-metric target variable in combination with formatively measured constructs is still a particular challenge for the PLS-approach.By using the plspm package (Sanchez/Trinchera/Russolillo 2017), we tested a model from the human resources management field. Main goal of this model is to examine the moderating and mediating role of meaning at work for the relationship between several social, personal, environmental and motivational job characteristics and the intention to quit as a manifest binary target variable. Coping with the complexity of the model, consisting of more than 70 latent variables, all formatively measured, many of them one indicator constructs, there are some pitfalls in the application of the plspm package, but due to the flexibility of R, it is possible even to evaluate such a complex model.
11:30	Statistical modeling	Murray Cameron	P10	Exceeding the designer's expectation	algorithms, models, applications	Matteo Fasiolo	click here
Statistical methods and their software implementation are generally designed for a particular class of applications. However, the nature of data, analysis and statisticians is that uses of the methods are envisaged that extend the application. Sometimes the reason is the nature of the data, sometimes it is a new type of model and sometimes it is the limitations of the software available. Software for regression and for generalised linear models have regularly been used in 'non-standard' ways.We will discuss some examples, considering some changepoint models in particular and emphasise some old lessons for software developers.
10:30	Better data performance	David Cooley	P6	Starting with geospatial data in Shiny, and knowing when to stop	visualisation, databases, web app, performance, spatial	David Smith	NA
Theme:Coupling R with geospatial databases to reduce the calculations and data in R and improve shiny app speedLike any web page, Shiny apps need to be quick and responsive for a better user experience. Doing complex calculations and storing large data objects will slow the app. Therefore, it's often desirable to remove as much of this as possible from the app. The talk will demonstrate- Using MongoDB as a geospatial database- Querying & returning geospatial data to R from MongoDB- Comparison and benchmarking of geospatial operations in R vs on the database server- Applying this to a shiny app with a demonstration, highlighting the pros & cons- Introducing the latest updates to the `googleway` package for displaying data and using Google Map tools through R- Using Google Maps to trigger database queries and operations
10:50	Better data performance	Jeffrey O. Hanson	P6	prioritizr: Systematic conservation prioritization in R	reproducibility, space/time, performance, conservation	David Smith	NA
Biodiversity is in crisis. To prevent further declines, protected areas need to be established in places that will achieve conservation objectives for minimal cost. However, existing decision support tools tend to offer limited customizability and can take a long time to deliver solutions. To overcome these limitations and help prioritize conservation efforts in a transparent and reproducible manner, here we present the prioritizr R package. Inspired by the tidyverse principles, this R package provides a flexible interface for articulating, building and solving conservation planning problems. In contrast to existing tools, the prioritizr R package uses integer linear programming (ILP) techniques to mathematically formulate and solve conservation problems. As a consequence, the prioritizr R package can find solutions that are guaranteed to be optimal and in record time. By finding solutions to problems that are relevant to the species, ecosystems, and economic factors in areas of interest, conservation scientists, planners, and decision makers stand a far greater chance at enhancing biodiversity. For more information, visit https://github.com/prioritizr/prioritizr.
11:10	Better data performance	Remy Gavard	P6	Using R to pre-process ultra-high-resolution mass spectrometry data of complex mixtures.	algorithms, applications	David Smith	NA
Scientists are able to determine over hundreds of thousands of components in crude oil using Fourier transform ion cyclotron resonance mass spectrometry (FTICR-MS). The statistical tools required to analyse the mass spectra struggle to keep pace with advancinginstrument capabilities and increasing quantities of data. Today most ultrahigh resolution analyses for complex mixture samples are based on single, labour-intensive, experiments.We present a new algorithm developed in R named Themis to jointly pre-process replicate measurements of a complex sample analysed using FTICR-MS. This improves consistency as a preliminary step to assigning chemical compositions, and the algorithm has a quality control criterion. Through the use of peak alignment and an adaptive mixture model-based strategy, it is possible to distinguish true peaks from noise.Themis demonstrated a more effective removal of noise-related peaks and the preservation and improvement of the chemical composition profile. Themis enabled the isolation of peaks that would have otherwise been discarded using traditional peak picking (based upon signal-to-noise ratio alone) for a single spectrum.
11:30	Better data performance	Joshua Bon	P6	Semi-infinite programming in R	algorithms, models	David Smith	NA
Semi-infinite programming (SIP) is an optimisation problem where, generally, there are a finite number of variables but an infinite number of (parametrised) constraints. We show how to optimise simple SIP problems in R, in particular SIP for shape-constrained regression. The package sipr (under development) will be presented and collaboration sought from those in attendance.
13:00	Keynote	Bill Venables	AUD	Adventures with R: Two stories of analyses and a new perspective on data	NA	Paul Murrell	click here
I will discuss two recent analyses, one from psycholinguistics and the other from fisheries, that show the versatility of R to tackle the wide range of challenges facing the statistician/modeller adventurer. I will conclude with a more generic discussion of the status and role of data in our contemporary analytical disciplines and offer an alternative perspective from the current orthodoxy.
14:00	R in the community	Simon Jackson	P9	R from academia to commercial business	applications, community/education, big data, industry, skill development	Rhydwn McGuire	NA
A 2017 report by StackOverflow showed that the use of R is greatest and growing fastest in academia. Commercial industries like tech, media, and finance, however, show the smallest usage and lowest adoption rates of the language. Yet learnings regarding the use of R and data science in academia and commercial settings complement each other. This presentation will share my experience as an R user moving from academia into commercial business; the transition moving from cognitive scientist at an Australian University to being a data scientist at one of the world’s largest travel e-commerce sites, Booking.com. I’ll discuss how the cutting-edge R skills used in academia can improve commercial product development. I will also identify the knowledge gaps I had moving into commercial business. This will be relevant to academics looking to move into industry, and business employers looking to hire data scientists from academia.
14:20	R in the community	Joseph Rickert	P9	Connecting R to the "Good Stuff"	algorithms, models, applications, big data, interfaces	Rhydwn McGuire	NA
In his book, Extending R, John Chambers writes: One of the attractions of R has always been the ability to compute an interesting result quickly. A key motivation for the original S remains as important now: to give easy access to the best computations for understanding data.R developers have taken the challenge implied in John’s statement to heart, and have integrated R with some really “good stuff’ while providing easy access that conforms to natural R workflows. Rcpp and Shiny, for example, are both spectacularly successful projects in which R developers expanded the reach of R by connecting to external resources.In this talk, I will survey the ongoing work to connect R to “good stuff” such as the CVX optimization software, the Stan Bayesian engine, Spark, Keras and TensorFlow; and provide some code examples including using the sparlkyr package to run machine learning models on Spark and the keras package to run deep learning and other models on TensorFlow.
14:40	R in the community	Lisa Chen	P9	Using R to help industry clients – The benefits and Opportunities	visualisation, algorithms, models, data mining, applications, web app, reproducibility, multivariate, networks, performance, text analysis/NLP, big data	Rhydwn McGuire	NA
Dr Lisa Chen is Chief Analytics Officer for Harmonic Analytics. She is a highly qualified and experienced data scientist, with a PhD in Statistics and a Bachelor of Science in Computer Science and Statistics. Lisa has extensive experience using R including designing solution-based models for complex optimisation problems, and analysing large-scale datasets in R. Harmonic has helped customers globally to address business challenges, across sectors including; agriculture, aviation, banking, energy, government, health, telecommunications and utilities. We use R in our daily project work and also help clients with data science team development and R Training. We will outline how we have used R, and RShiny and the benefits realised. We will discuss our journey, data-driven approach, workflow and industry observations. We will discuss our learning with R, e.g. observations regarding Big Data with R, version control and some of the pain points & work-arounds. We will share our observations on how clients are starting to adapt open source and R for their analytical work, plus the trends and opportunities. Lisa will demonstrate examples of our interactive client dashboards.
14:00	Community and education	Jonathan Carroll	P6	Volunteer Vignettes; A Case-Study in Enhancing Documentation	applications, reproducibility, community/education, documentation	Kim Fitter	click here
Vignettes; long-form documentation for a package. Often a use-case, discussion, or scientific article. These are incredibly useful to both users and developers. In 2017, Julia Silge scraped CRAN and found most packages don't have one [1].At the start of 2018, I decided to give back to the community by 'being the change I wanted to see in the world' and writing a Volunteer Vignette a month, for the entire year. Yet all the new and interesting packages I could think to write something for already had vignettes.The solution came to me in February; have the community nominate packages. I made the call via Twitter [2] and received an encouraging response. I set about writing the first Volunteer Vignette and immediately discovered bugs and other issues, all of which have lead to positive discussions with the author and updates to the package.In this talk I will present my first six months of the Volunteer Vignettes Project. I will demonstrate why vignettes are an invaluable step in making a robust R package.[1] https://juliasilge.com/blog/mining-cran-description/[2] https://twitter.com/carroll_jono/status/961139524901527552
14:20	Community and education	Robin Hankin	P6	Special and general relativity in R	visualisation, community/education, space/time	Kim Fitter	NA
Although mostly used for statistics, R is a general purpose tool andhere I discuss how the R programming language can be used in thecontext of physics education. Here I introduce two R packages thathave been used in the teaching of Einstein's theories of special andgeneral relativity.The 'gyrogroup' package implements the Lorentz boosts for relativisticvelocity addition. It provides dramatic visualization of thelittle-known fact that relativisitic Lorentzian velocity addition isneither commutative nor associative. The 'schwarzschild' packagepresents visualization of black hole physics, and gravitational waves.In this presentation I discuss these two packages and also the moregeneral issue of R used as a teaching tool in the context of physicsmore generally.
14:40	Community and education	Sam Clifford	P6	Classes without dependencies	community/education	Kim Fitter	click here
Although important, learning statistics isn’t generally why students choose to study science. To engage a cohort of first year Bachelor of Science students with diverse backgrounds and interests, we decided to design their core first year quantitative methods unit (with no math or programming prerequisites) around R.The course is designed to be practical; using RStudio and tidyverse packages rather than statistical tables, students can quickly engage in visualisation, data wrangling, writing functions, and modelling as part of a coherent workflow for scientific inquiry.In this talk, we discuss the learning and teaching principles and activities, outlining the use of blended and problem based learning to teach both the quantitative topic and the use of R, developing students' data analysis skills and confidence.We discuss how workshop activities , quizzes, problem solving tasks, and the final project (a collaborative scientific article) not only assess students' skills but prepare them for work as a professional scientist. We will discuss students’ feedback on their experience in their journey from novice student to young scientist.
14:00	Scalable R	Le Zhang	P7	Build scalable Shiny applications for employee attrition prediction on Azure cloud	visualisation, models, data mining, applications, web app, reproducibility, performance	Michael Lawrence	NA
Voluntary employee attrition may negatively affect a company in various aspects. Identifying employees with inclination of leaving is therefore pivotal to save potential loss. Data-driven techniques, assisted by a machine learning model, exhibit high accuracy in prediction for employee attrition and offer company executives insightful information for decision making.The talk will cover a step-by-step tutorial about how to build a model for employee attrition prediction and deploy such analytical solution as Shiny-based web service on Azure cloud. R is used as the primary programming language and method for the development. Novel R packages such as AzureSMR and AzureDSVM that allow data scientists and developers to programmatically operate cloud resources and seamlessly operationalize the analytics within an R session, will also be introduced in the talk. Shiny application of the analytics including interactive data visualization and model creation is designed and deployed on Docker containers orchestrated by Kubernetes. Parameters of the deployment environment are carefully tuned to favor scalability of the application.
14:20	Scalable R	Bryan Galvin	P7	Moving from Prototype to Production in R: A Look Inside the Machine Learning Infrastructure at Netflix	data mining, reproducibility, performance, big data, interfaces	Michael Lawrence	NA
Machine learning helps inform decision making on just about every aspect of the business at Netflix, so it is important to empower our data scientists with tooling that makes them more effective. To accomplish this, we developed Metaflow-a platform written in Python for data scientists to develop, run, and deploy projects without getting in their way. Some key design features include: * Ability to work with the R packages we all know and love with no restrictions* Scale up seamlessly from local development to the almost infinite resources in the cloud * Automatic checkpointing of data and code with immutable snapshots created at each step of the modeling pipeline * Deployment made easy with built-in hosting service and schedulingIn this talk, I will present an overview of some of the best practices that are baked into Metaflow, focusing especially on those that can be applied effectively at organizations that are not at Netflix scale. Additionally, I will cover some of the lessons learned from using reticulate to interface R with a large Python project.
14:40	Scalable R	Jason Gasper	P7	Integrating R into a production data environment: A case example of using Oracle database services and R for fisheries management in Alaska.	applications, databases, reproducibility	Michael Lawrence	NA
Catch and economic information from fisheries off Alaska are critical for the management and conservation of marine resources. The National Marine Fisheries Service, Alaska Regional Office, uses an Oracle database to monitor and store federal fishery catch data off Alaska. Annually, the system processes over 2 million+ fishery catch transactions, and it currently houses over 25 years of historical fishery data. Information in the database includes details on harvested fish, estimates of bycatch, at-sea observations of discards, electronic monitoring of catch (video-derived estimates), geospatial information, and complex business rules to monitor catch allocations to ensure overfishing does not occur. Our paper provides an high-level overview of the system architecture, with a focus on our use of R-Cran for both development (e.g., simulation and testing) and production (e.g., statistical features) within our Oracle database.
14:00	Visualisation	Paul Murrell	P10	The Minard Paradox	visualisation	Carson Sievert	click here
Charles Joseph Minard's depiction of Napoleon's 1812 Russian campaignmight be described as the best statistical graphic ever drawn ... byhand. Minard did not have the benefit of modern computer technologyto help with his drawing; he did not have the option of importing aGoogle map tile; and he probably did not even consider the possibilityof interactive tooltips. However, there are aspects of what Minardproduced by hand that are very challenging for modern graphicalsoftware, particularly the thick bands that represent the size ofNapolean's army over time. This talk will describe the 'vwline'package for R and explore some of the interesting challenges thatarise when attempting to render variable-width lines with software.
14:20	Visualisation	Natalia da Silva	P10	Interactive Graphics for Visually Diagnosing Forest Classifiers in R	visualisation, data mining, web app	Carson Sievert	NA
This paper describes structuring data and constructing plots to explore forest classification models interactively. A forest classifier is an example of an ensemble since it is produced by bagging multiple trees. The process of bagging and combining results from multiple trees produces numerous diagnostics which, with interactive graphics, can provide a lot of insight into class structure in high dimensions. Various aspects are explored in this paper, to assess model complexity, individual model contributions, variable importance and dimension reduction, and uncertainty in prediction associated with individual observations. The ideas are applied to the random forest algorithm (Breiman, 2001) and projection pursuit forest (da Silva et al., 2017), but could be more broadly applied to other bagged ensembles. Interactive graphics are built in R (R Core Team, 2016) using the ggplot2( Wickham, 2016), plotly (Sievert et al., 2017), and shiny (Chang et al., 2015) packages.
14:40	Visualisation	Chun Fung (Jackson) Kwok	P10	Rjs: Going hand in hand with Javascript	visualisation, interfaces, JavaScript	Carson Sievert	NA
Many of the popular data visualisation packages in R, e.g. Plotly, Leaflet and DiagrammeR, are powered by JavaScript. I will demonstrate how far a little JavaScript can go towards creating animated and interactive visualisations from within R. This is done with the package, Rjs, which provides a simple interface between R and JavaScript. It allows you to seamlessly combine R modelling packages with JavaScript interactive visualisation libraries. This talk is for researchers, data analysts, and intermediate R users looking to extend their skills in interactive data visualisation.
14:00	Complex models and performance	Hong Ooi	P8	SAR: a practical, rating-free hybrid recommender for large data	algorithms, models, applications, big data	Kelly O'Briant	NA
SAR (Smart Adaptive Recommendations) is a fast, scalable, adaptive algorithm for personalised recommendations, based on user transaction history and item descriptions. From an end-user's point of view, SAR has the following benefits. First, it is relatively easy to explain to a nontechnical audience, compared to algorithms that rely on matrix factorisation. Second, it doesn't use subjective ratings, which can be unreliable given the pervasive influence of social media: a product that gets review-bombed after going viral will have meaningless ratings. Third, it takes event times into account, thus allowing recommendations to evolve with changing trends. Finally, it does well in recommending cold items, by building a regression model on item data. In this talk I'll discuss two separate implementations of SAR: a standalone one in base R, and an interface to an Azure web service. The former allows easy experimentation and evaluation, while the latter provides more options and is scalable to production-scale datasets.
14:20	Complex models and performance	Fang Zhou	P8	Jumpstart Machine Learning with Pre-Trained Models	algorithms, models, reproducibility, interfaces	Kelly O'Briant	NA
As a community many of us are building models (statistical and machine learning) that address various scenarios. At conferences, like UseR!, but also across many academic conferences, researchers publish papers that introduce new algorithms with implementations available on GitHub, implemented in R and Python and other frameworks. The community also makes available pre-trained models, especially deep learning models, to demonstrate or highlight the capabilities of the algorithm. To foster a healthy collaboration and for the reproducibility of key results, it is important that fellow data scientists can read about a new algorithm or approach and to be able to try it out very quickly to see whether it meets their needs. While pre-trained machine learning models are available, they are often difficult to set up and evaluate. We are exploring a framework to make this process simpler by making it easy for any data scientist to investigate and evaluate pre-trained models. We will share our learnings and our proposal to enable data scientists to quickly discover pre-trained models that will support them to be able to get from zero to hero in short order.
14:40	Complex models and performance	Stepan Sindelar	P8	FastR: an alternative R language implementation	applications, performance, R implementations	Kelly O'Briant	NA
R is a highly dynamic language that employs a unique combination of data type immutability, lazy evaluation, argument matching, large amount of built-in functionality, and interaction with C and Fortran code. It is therefore a challenging task to develop an alternative R runtime that is both compatible with GNU R and can provide performance of R code comparable to static programming languages like C.FastR is an open source alternative R implementation that is trying to achieve this. The talk will introduce FastR and demonstrate the performance improvements it can offer, compatibility with GNU R by being able to run unmodified popular complex CRAN packages like ggplot2 or Shiny, and FastR unique features, for example in-process multi-threaded execution, and tools like CPU sampler or viewing R memory dumps with VisualVM.
15:30	Genomics, signatures to single cells	Momeneh (Sepideh) Foroutan	P10	Singscore: a single-sample gene-set scoring method for analysing molecular signatures	visualisation, applications, bioinformatics	Peter Hickey	NA
Several single sample gene-set enrichment analysis methods have been introduced to score samples against gene expression signatures, such as ssGSEA, GSVA, PLAGE and combining z-scores. Although these methods have been proposed to generate single-sample scores, they use information from all samples in a dataset to calculate scores for individual samples. This leads to unstable scores which are influenced by sample size and composition in datasets. We have proposed singscore, a ranked-based and truly single-sample scoring method implemented as an R/Bioconductor package singscore. We compare singscore to other methods and show that our approach performs as well as other methods for large datasets in terms of stability, while outperforming them in small datasets. Singscore is fast and generates easily-interpretable scores. We show the application of this method in cancer biology, where the dependence between distinct molecular signatures can be investigated across samples. Singscore has potential applications in personalised medicine, as it calculates replicable scores for individual samples regardless of the sample size or composition in the data.
15:50	Genomics, signatures to single cells	Liam Crowhurst	P10	scIVA: Single Cell Interactive Visualisation and Analysis	visualisation, data mining, web app, reproducibility, bioinformatics, big data	Peter Hickey	NA
Technological advances enable measurements of gene expression at single cell resolution, creating datasets for investigating biological processes in life science research. Gene Expression data is commonly represented as a matrix of tens of thousands of genes and up to millions of cell, which has created a demand amongst biologists for quick visualisation and analysis. We developed scIVA, a Shiny web app that is designed to be used as an interactive visualisation tool of gene expression datasets, intended for those with little R experience and for users to gain preliminary insights into datasets for further exploration and analysis. The web app will also be available for download as a standalone R package. The web app performs various visualisations, all of which are interactive and downloadable through use of Plotly, integrated with d3 Javascript, as graphing tools. Moreover, scIVA allows for users to search for specific genes, subset by clusters and subpopulations, generate heatmaps and perform statistical analyses. The presentation will include a demonstration of the web app’s key features.
16:10	Genomics, signatures to single cells	Sarah Williams	P10	Celaref: Annotating single-cell RNAseq clusters by similarity to reference datasets	applications, bioinformatics	Peter Hickey	click here
Single-cell RNA sequencing (scRNAseq) is a way of measuring gene expression of many individual cells simultaneously, and is often used on samples which contain a mix of different cell types. In an scRNAseq analysis individual cells are typically clustered to group them by cell type. After clustering, identifying what type of cell is in each cluster (e.g. neurons) usually needs domain-specific knowledge of marker genes and function. The celaref package accepts pre-computed cell-clusters and aims to suggest cell-types for each cluster via similarity to reference datasets (scRNAseq experiments or microarrays) from similar samples. Briefly, within-dataset differential expression is calculated to identify the most enriched genes for each cluster, then their rankings are examined in reference datasets. Kolmogorov–Smirnov tests are used to decide if multiple matches should be reported. Initial experiments on brain, lacrimal gland and blood PBMC samples show sensible matching between similar cell types without overreaching on dissimilar cells. Celaref will be submitted to Bioconductor and is available at https://github.com/MonashBioinformaticsPlatform/celaref
16:30	Genomics, signatures to single cells	Luke Zappia	P10	clustree: a package for producing clustering trees using ggraph	visualisation, algorithms, data mining, bioinformatics	Peter Hickey	click here
Clustering analysis is commonly used in many fields to group together similar samples. Many clustering algorithms exist, but all of them require some sort of user input to set parameters that affect the number of clusters produced. Deciding on the correct number of clusters for a given dataset is a difficult problem that can be tackled by looking at the relationships between samples at different resolutions. Here I will present clustree, an R package for producing clustering tree visualisations. These visualisations combine information from multiple clusterings with different resolutions, showing where new clusters come from and how samples change clusters as the number of clusters increases. Summarised information describing the samples in each cluster can be overlaid on the tree to give additional insight. I will also describe my experience developing clustree, particularly how I have made use of the ggraph package. The clustree package is available at https://github.com/lazappi/clustree and a preprint describing clustering trees can be read at https://www.biorxiv.org/content/early/2018/03/02/274035.
15:30	Data mining	Ilia Karmanov	P9	Teach yourself deep-learning with R	visualisation, algorithms, models, Deep Neural Nets, CNNs, MLPs, Machine Learning	Kevin Kuo	NA
R's concise matrix algebra and calculus functionality makes it easy to create machine learning-models from scratch. Creating models from scratch is a great way to learn how they actually work. We show how R can be used to create a linear regression, MLP and CNN from scratch (see blog: http://blog.revolutionanalytics.com/2017/07/nnets-from-scratch.html) and thus how one may go about teaching themselves about DNNs. We believe this "hands-on" approach to learning is more effective because it exposes the user to all the "leaky abstractions" that modern frameworks hide and helps them understand what makes the models fragile. R's simple interface lets us easily "play" with the created models to understand further (potentially abstract) topics, e.g: (i) visualise the classification boundary and thus investigate what effect number of neurons (and layers) have. (ii) Visualise different CNN filter-maps. (iii) Solve a neural-net deterministically through linear-programming (without SGD) by working through "Proof of Theorem 1" in "Understanding deep learning requires re-thinking generalization" by Zhang 2017 (as a mirror to solving linear regression with SGD).
15:50	Data mining	Angus Taylor	P9	Deep learning at scale with Azure Batch AI	algorithms, models, Deep learning	Kevin Kuo	NA
In recent years, R users have been increasingly exploring the use of deep learning methods to solve difficult problems from computer vision to natural language processing. However, developing deep learning models is a time-consuming and compute-intensive task. To obtain good performance on many datasets, it is necessary to test many combinations of network structures and hyperparameters. In this talk, we will discuss how Microsoft Azure Batch AI can be used to perform this tuning task at scale on clusters of GPU-enabled virtual machines in the cloud. Developers create a single R script to define tests of multiple different network configurations, using the popular deep learning frameworks mxnet or Keras. We explain how to build a simple Docker image that can be deployed across multiple machines and defines the necessary installation dependencies. Batch AI will scale VM clusters as necessary to parallelize the tasks and obtain the optimal network configuration efficiently, saving hours or even days of the developer’s time. We will demonstrate the value of Batch AI with a live demo of training a deep learning model, implemented in R, on the classic MNIST computer vision dataset.
16:10	Data mining	Timothy Wong	P9	Modelling Field Operation Capacity using Generalised Additive Model and Random Forest	algorithms, models, multivariate, big data	Kevin Kuo	click here
In any customer-facing business, accurately predicting demand ahead of time is of paramount importance*. Workforce capacity can be flexibly scheduled at local area accordingly. In this way, we can ensure having sufficient workforce to meet volatile demand.In this case study, we focus on the gas boiler repairing field operation in the UK. We have developed a prototype capacity forecasting procedure which uses a mixture of machine learning techniques to achieve its goal. Firstly, it uses Generalised Additive Model approach to estimate the number of incoming work requests. It takes into account the non-linear effects of multiple predictor variables. The next stage uses a large random forest to estimate the expected number of appointments for each work request by feeding in various ordinal and categorical inputs. At this stage, the size of the training set is considerable large and does not fully-fit in memory. In light of this, the random forest model was trained in chunks / parallel to enhance computational performance. Once all previous steps have been completed, probabilistic input such as the ECMWF Ensemble weather forecast to give a view of all predicted scenarios.
16:30	Data mining	Bernd Bischl	P9	iml: A new Package for Model-Agnostic Interpretable Machine Learning	algorithms, models, machine learning	Kevin Kuo	NA
iml implements model-agnostic interpretability methods to explain the functional behavior and individual predictions of machine learning models. A large advantage of model-agnostic interpretability methods over model-specific ones is their flexibility, as often not one but many types of machine learning models are evaluated for solving a task. Anything that is build on top of an interpretation such an interpretation, e.g., a visualization or graphical user interface, now also becomes independent of the underlying model.Currently implemented are:Feature importance, Partial dependence plots, Individual conditional expectation plots (ICE), Tree surrogate, LocalModel: Local Interpretable Model-agnostic Explanations, Shapley value for explaining single predictions.The talk will cover the basic concepts behind model-agnostic interpretations, and demonstrate the functionality of the package through applied examples in R.Link to CRAN release: https://cran.r-project.org/web/packages/iml/index.htmlLink to Github page: https://github.com/christophM/iml
15:30	Simulation and modeling focus on surv anal	Bachmann Patrick	P6	Estimating individual Customer Lifetime Values with R: The CLVTools Package	models	Marie Trussart	NA
Valuing customers is key to any firm. Customer lifetime value (CLV) is the central metric for valuing customers. It describes the long-term economic value of customers and gives managers an idea of how customers will evolve over time. To model CLVs in continuous non-contractual business settings such as retailers, probabilistic customer attrition models are the preferred choice in literature and practice. Our R package CLVTools provides an efficient and easy to use implementation frameworks for probabilistic customer attrition models. Building up on the learnings of other implementations, we adopt S4 classes to allow constructing rich and rather complex models that nevertheless still are easy to apply for the end user. In addition, the package includes recent model extensions, such as the option to consider contextual factors, that are not available in other packages.This article will focus on both, the theory of the underlying statistical framework as well as about the practical application using real world data.
15:50	Simulation and modeling focus on surv anal	Sam Brilleman	P6	simsurv: A Package for Simulating Simple or Complex Survival Data	models, simulation; survival analysis	Marie Trussart	click here
The simsurv package allows users to simulate simple or complex survival data. Survival data refers to a variable corresponding to the time from a defined baseline until occurrence of an event of interest. Depending on the field, the analysis of survival data can be known as survival, duration, reliability, or event history analysis. It has been common to make simplifying parametric assumptions when simulating survival data, e.g. assuming survival times follow an exponential or Weibull distribution. However, such assumptions are unrealistic in many settings. The simsurv package provides additional flexibility by allowing users to simulate survival times from 2-component mixture distributions or a user-defined hazard function. The mixture distributions allow for a variety of flexible baseline hazard functions. Moreover, a user-defined hazard function can provide even greater flexibility since the cumulative hazard does not require a closed-form solution. This means it is possible to simulate survival times under complex statistical models such as those for joint longitudinal-survival data. The package is modelled on the survsim package in Stata (Crowther and Lambert, 2012, Stata J).
16:10	Simulation and modeling focus on surv anal	Raju Rimal	P6	R-package for simulating linear model data (simrel)	models, applications, web app, multivariate, interfaces, Simulation	Marie Trussart	click here
Data science is generating enormous amounts of data, and new and advanced analytical methods are constantly being developed to cope with the challenge of extracting information from such “big-data”. Researchers often use simulated data to assess and document the properties of these new methods. Here we present an R-package `simrel`, which is a versatile and transparent tool for simulating linear model data with an extensive range of adjustable properties. The method is based on the concept of relevant components and a reduction of the regression model. The concept was first implemented in an earlier version of `simrel` but only for single response case. In this version we introduce random rotations of latent components spanning a response space in order to obtain a multivariate response matrix Y. The properties of the linear relation between predictors and responses are defined by a small set of input parameters which allow versatile and adjustable simulations. In addition to the R-package, user-friendly shiny application with elaborate documentation and an RStudio gadget provide an easy interface for the package.
16:30	Simulation and modeling focus on surv anal	Andrés Villegas	P6	StMoMo: An R Package for Stochastic Mortality Modelling	models, applications	Marie Trussart	NA
In this talk we use the framework of generalised (non-)linear models to define the family of generalised Age-Period-Cohort stochastic mortality models which encompasses the vast majority of stochastic mortality projection models proposed to date, including the well-known Lee-Carter and Cairns-Blake-Dowd models. We also introduce the R package StMoMo which exploits the unifying framework of the generalised Age-Period-Cohort family to provide tools for fitting stochastic mortality models, assessing their goodness of fit and performing mortality projections. We illustrate some of the capabilities of the package by performing a comparison of several stochastic mortality models applied to the Australian mortality experience. The R package StMoMo is available at http://CRAN.R-project.org/package=StMoMo.
15:30	Improving performance	Helena Kotthaus	AUD	Optimizing Parallel R Programs via Dynamic Scheduling Strategies	models, performance	Earo Wang	NA
We present scheduling strategies for optimizing the overall runtime of parallel R programs. Our proposal improves upon the existing mclapply function of the parallel package, which already offers a load balancing option that dynamically allocates tasks to worker processes. However, this mechanism has shortcomings when used on heterogeneous hardware architectures, where different CPU cores might have vastly different performance characteristics. We thus propose to enhance mclapply with a new parameter that allows mapping tasks to specific CPUs. The new affinity.list parameter, already available on the R-devel branch, allows setting a so-called CPU affinity mask that specifies on which CPU a given task is allowed to run. We demonstrate the benefits of the new mclapply version by showing how it can speed up parallel applications like parameter tuning. In this case study, we develop a regression model that guides the scheduling by estimating the runtime of a task for each processor type based on previous executions. In a series of code examples, we explain how this approach can be generalized to develop efficient scheduling strategies for parallel R programs.
15:50	Improving performance	Stepan Sindelar	AUD	Combining R and Python with GraalVM	applications, performance, programming languages interoperability, debugging	Earo Wang	NA
GraalVM is a multi-language runtime that allows to run and combine multiple programming languages in one process and operating on the same data without the need to copy the data when crossing language boundaries. Moreover, the dynamic just-in-time compiler included in GraalVM is capable of applying optimizations across the languages boundaries. The languages implemented on top of GraalVM include FastR, an alternative R implementation, C, Ruby, JavaScript, and recently added GraalPython.The talk will present interesting ways how R and Python can be combined into a polyglot application running on GraalVM, for example using R package from Python or vice versa, and briefly explain how this interoperability works on the technical level. One of the most important parts of a language ecosystem is tooling and especially interactive debugger. The talk will also present how one can debug multiple GraalVM languages at the same time in the Google Chrome Dev Tools, for instance stepping from R into C code.
16:10	Improving performance	David Smith	AUD	Speeding up computations in R with parallel programming in the cloud	models, performance, parallel programming	Earo Wang	click here
There are many common workloads in R that are "embarrassingly parallel": group-by analyses, simulations, and grid-based computations are just a few examples. In this talk, I'll provide a review of tools for implementing embarrassingly parallel computations in R, including the built-in "parallel" package and extensions such as the "foreach" package. I'll also demonstrate how you can dramatically reduce the time for a complex computation -- optimizing hyperparameters for a predictive model with the "caret" package -- by using a cluster of parallel R session in the cloud. With the "doAzureParallel" package, I'll show how you can create a cluster of virtual machines running R in Azure, parallelize the problem by registering backend to "foreach", and shut down the cluster when the computation is complete, all with just a few lines of R code.
16:30	Improving performance	Romain François	AUD	rrrow: an R front end to Apache Arrow	algorithms, performance, big data, streaming data, interfaces	Earo Wang	NA
Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication. Languages currently supported include C, C++, Java, JavaScript, Python, and Ruby.R support is currently being implemented and in this talk we will discuss the various challenges, and our short, medium and long term vision for the connection between R and Apache Arrow.
15:30	Sports analytics	Robert Nguyen	P7	Using Australian Rules Football to Broaden the Appeal of R and Statistics Among Youth and Public Without a STEM Background	visualisation, models, applications, reproducibility, community/education, interfaces	Alex Whan	NA
Our talk explores how sports analytics can be used to encourage those without a STEM background into the application of statistics and programming to a real world environment. Through the use of a R package (fitzRoy) related to AFL we aim to both lower the barrier to entry for data access while also increasing analytical fan engagement in AFL. We will also talk about common issues that arise for the first time R package builder.A key barrier to entry for the growth of the AFL community is data access which prevents not only people having a go at writing, but it also prevents current media having reproducible work. By having an R package with online lessons on creating common fan rating systems like ELO, Pythagorean and Massey this will engage people who otherwise might have put learning statistical modelling and R into their personal this is too hard bucket. Commonly users are taught from a cleaned dataset and jump straight into modelling. This misses a key part which is cleaning. Our package, we aim to use tangible examples of scraped, raw AFL data from afltables and footywire to teach users how to clean scraped data themselves to get it into a tidy format for modelling.
15:50	Sports analytics	Alex Fun	P7	Using TMB (Template Model Builder) to predict the winner of a ping pong match	algorithms, models, applications	Alex Whan	NA
In a recent and popular stats.stackexchange post, the following question was asked:“I bet with my colleague that I will beat him in fifty consecutive ping pong games. So far I have won 15, what are my chances of winning the next 35 games?”-- from https://stats.stackexchange.com/questions/329521/To answer this question, I propose the following data generation process for the score-line in each game: The OP (original poster) is a far superior player who still wishes to make the game fun for their opponent (they are colleagues after all). This leads to a regression problem for the OP’s probability of winning a point, that cannot be fit using standard regression packages. This introductory talk will demonstrate how to use the TMB (Template Model Builder) package with an optimisation algorithm to find maximum likelihood estimates for the regression coefficients. This will show that TMB is a very useful and efficient tool, that allows the practictioner a lot of flexibility in exploring novel data generation processes and objective functions. I will also briefly touch upon using C++ from R, and automatic differentiation, which is great for those that dislike multivariate calculus.
16:10	Sports analytics	Andrew Simpkin	P7	A Shiny app used to predict training load in professional sports	visualisation, algorithms, models, applications, databases, web app, multivariate, performance, streaming data	Alex Whan	NA
We have developed a Shiny dashboard web application used in professional sports to predict player load while planning a training session. This app allows coaches to better plan, prescribe and tailor training drills in advance. The Shiny dashboard app is deployed on Shiny Server Pro and connects to an SQL database of GPS data across multiple teams and sports. Teams can plan, save, edit and delete planned sessions to and from the GPS database. Based on retrospectively collected GPS and accelerometer data, we have developed a statistical learning algorithm to cluster similar drills and predict training load. The model achieves correlations over 0.95 in out-of-sample testing, with median differences of below 1% of GPS outcomes.
16:30	Sports analytics	Sayani Gupta	P7	CricketData: An R package for international cricket data	visualisation, data mining, applications, web app, reproducibility	Alex Whan	NA
The CricketData package provides convenient scraper functions for downloading data from ESPNCricinfo into tibbles. Functions are provided for obtaining data on the performance of male and female players across Test, One Day International and Twenty20 formats, and for batting, bowling and fielding. Tidyverse packages can then be used to explore, visualise and analyse the data. The package enables a user to answer simple questions such as -What is the highest number of catches taken by a wicket keeper?-What is the maximum number of catches taken by a fielder in a particular innings?-How many batsmen have scored consecutive 100’s in two matches or more?-What is the maximum number of maiden overs by a bowler in a specific innings?It will also allow for deeper questions to be addressed such as-Do batsman tend to get run out more frequently when they are about to score a century?-How does the performance of cricketers change in the 12 months before he/she retires?-When is the period of peak performance during a cricketer’s career?Finally, it makes it easy to produce visual comparisons of player performance across different statistics.
15:30	Leveraging web apps	Katie Sasso	P8	Shiny meets Electron: Turn your Shiny app into a standalone desktop app in no time	applications, databases, web app, reproducibility, interfaces, Automation	Johnathan Carroll	NA
Using Shiny in consulting can be challenging, as all deployment options involve either sending intellectual property and data to the cloud or IT involvement. When providing consultative like services to extremely large, risk-averse, enterprises this can greatly restrict one’s ability to quickly get Shiny apps into users’ hands, as engagement of IT can take months if approved at all. We’ll share how the Columbus Collaboratory team overcame these barriers to rapid deployment by coupling R Portable and Electron, a framework for creating native applications with a variety of web technologies. All the tools needed to use Electron for desktop deployment of Shiny apps will be reviewed. We’ll highlight a specific example in which these technologies were used within a large enterprise to completely automate a weekly report. We’ll also share how the app used R Packages such as openxlsx, shinydashboard, RODBC, and Zoo to query an internal database, cleanse data, calculate key metrics, and create a downloadable excel file for dissemination. The best part? This Shiny app was delivered to the end business user as a stand-alone executable. https://github.com/ksasso/Electron_ShinyApp_Deployment
15:50	Leveraging web apps	Adrian Barnett	P8	Saving time for researchers by creating publication lists using shiny	applications, databases, web app, open access	Johnathan Carroll	click here
Researchers are often asked by funders or employers to list their publications, but funders often have different requirements (e.g., all papers versus only those in the last five years) and researchers waste a lot of time formatting papers. To save time for researchers I made a shiny application (https://aushsi.shinyapps.io/orcid/) that takes a researcher’s ORCID ID and outputs their papers in alternative formats. It uses crossref and pubmed (rentrez) to supplement the ORCID data. The app was included in the Australian Research Council’s instructions to applicants and has been well used with many good suggestions for improvements. However, the ORCID data is relatively messy and papers can be in multiple formats, making it difficult to create a standardised paper that can be flexibly manipulated. For example, the publication’s author data are in different fields and formats. Google Scholar publications are nicely standardised, but there are authentication issues when using shiny.I will describe how the app has developed and canvass how it could be improved, including adding the percent of publications that are open access or other alternative research metrics.
16:10	Leveraging web apps	Gergely Daroczi	P8	Managing database credentials and connections: an easy and secure approach	applications, databases, web app, interfaces, business	Johnathan Carroll	click here
Although the `DBI` R package family already provides a standardized way of opening connections to various databases and querying data, and eg the `config` package allows to store the database connection default parameters in a central file, maybe some of the sensitive fields encrypted via `keyring` or the `secret` packages -- but there is no convenient and secure wrapper around these for the actual R end-users. This talk introduces a new package taking care of opening connections in the background to the databases specified in a secured and encrypted YAML file, so that the R user can simply specify the SQL command without the need to think about what DB backend and credentials are used.
16:30	Leveraging web apps	Ian Hansel	P8	Large Scale Data Visualisation with Deck.gl and Shiny	visualisation, web app, space/time	Johnathan Carroll	click here
deck.gl is a WebGL-powered framework for visual exploratory data analysis of large datasets' - https://uber.github.io/deck.gl/#/Combining deck.gl and shiny allows for rich interactive graphics of large datasets, in particular visualising GeoSpatial data. We will review how to integrate deck.gl with shiny using the upcoming R package 'deck.gl'. The talk will:- Review the underlying technologies; WebGL, Mapbox and React.js- Dive into an example exploring the latest Census from the Australian Beureau of Statistics- Compare to existing visualisation capabilities in the 'rthreejs' and 'leaflet' packages- Discuss how further integrations with React.js can enable more browser based interfaces to data and analyticsAfter the talk the attendees should:- Know how Deck.gl works- Understand how to visualise data in deck.gl from R using the 'deck.gl' package- Want to use deck.gl in their own work :)The talk is aimed at those with some experience (or interest) in GeoSpatial analysis.

Time	Session	Presenter	Venue	Title	Keywords	Chair	Slides
9:05	Keynote	Roger Peng	AUD	Teaching R to New Users: From tapply to Tidyverse	NA	Nick Tierney	click here
The intentional ambiguity of the R language, inherited from the S language, is one of its defining features. Is it an interactive system for data analysis or is it a sophisticated programming language for software developers? The ability of R to cater to users who do not see themselves as programmers, but then allow them to slide gradually into programming, is an enduring quality of the language and is what has allowed it to gain significance over time. As the R community has grown in size and diversity, R’s ability to match the needs of the community has similarly grown. However, this growth has raised interesting questions about R’s value proposition today and how new users to R should be introduced to the system. I will discuss some lessons learned from my experience teaching R to new users and from observing the evolution of the language over the past 20 years.
10:30	Optimisation for model fitting	Anqi Fu	P6	Disciplined Convex Optimization with CVXR	models, data mining, applications, multivariate, big data, interfaces, optimization	Martin Maechler	click here
CVXR is an R package that provides an object-oriented modeling language for convex optimization, similar to CVX, CVXPY, YALMIP, and Convex.jl. It allows the user to formulate convex optimization problems in a natural mathematical syntax rather than the restrictive standard form required by most solvers. The user specifies an objective and set of constraints by combining constants, variables, and parameters using a library of functions with known mathematical properties. CVXR then applies signed disciplined convex programming (DCP) to verify the problem's convexity. Once verified, the problem is converted into standard conic form using graph implementations and passed to a cone solver such as ECOS or SCS. We demonstrate CVXR's modeling framework with applications in engineering, statistical estimation, and machine learning.
10:50	Optimisation for model fitting	Giuseppe Bruno	P6	Stochastic Gradient Descent: boosting its performances in R	data mining, big data	Martin Maechler	NA
Despite the tremendous improvements in HW&SW technologies, the requirements for training Machine Learning models keep growing. With standard loss functions the Gradient Descent (GD) provides a simple approach.The whole gradient is the sum of the gradients of each component function:∇ F(w) =2 = Σ(xiT w - yi) xi.The complexity per iteration is O(n d). Here we gauge the Stochastic Gradient Descent (SGD) where the gradient is approximated with one observation. When the stopping criterium is\|w_{k+1}-w_k\|<ε we have that GD requires O(log(1/ ε)) iteration while SGD needs O(1/ ε).Albeit the iterative nature of the SGD prevents its straightforward parallelization, a few alternatives have been proposed in the literature for carrying out its parallel implementation. In this paper we provide different benchmark examples of parallel implementation through standard shared memory and Spark distributed computing framework to boost the SGD performances. Preliminary results, under given conditions show significant performance improvements. The possibility to take advantage of these speed up is open to practitioners and not just computer specialists.
11:10	Optimisation for model fitting	Melina Ribaud	P6	Robustness criterion for the derivative kriging-based optimization	algorithms, models, applications	Martin Maechler	NA
In the context of robust shape optimization, the estimation cost of some numerical models is reduced with a kriging metamodel. The function, the first and second derivatives are provided by the majority of industrial codes. We propose a robust optimization procedure that leans on the prediction function and its derivatives. Those predictions are given by the kriging. The use of the derivatives improves the metamodel quality. We rely on Rccp and nloptr packages of R to estimate, predict and simulate the kriging with derivatives. Taylor theorem that is calculated with the prediction of the function and its derivatives is applied on each used point to approximate the variation of the function. This cheap criterion is used as the replacement of a full computation of the second moment from the model. A Pareto front of robust solutions (minimization of the function and the robustness criterion) is then generated by the NSGA-II genetic algorithm through the nsga2r package from R. This algorithm efficiently produces a Pareto front with no regard to the model complexity.
11:30	Optimisation for model fitting	Andrew Locke	P6	Augmented Lagrangian for constrained optimizations in Empirical Likelihood estimations	algorithms, models	Martin Maechler	NA
Empirical Likelihood is a useful tool for inference as it does not require knowledge about where the data comes from. It can be extended in many ways including regression or adding constraints using estimating equations.The positivity constraint has often been overlooked or ignored but this means existing methods may not be applicable for some data. We look at enforcing this constraint by applying the Karush-Kuhn-Tucker conditions and using a multiplicative iterative optimization method of updating parameters which ensures movement towards the maximum. We have programmed this method in R and use simulations to demonstrate the model works
10:30	Infrastructure and tools for genomic analysis	Ido Bar	P7	Shinotate: an R-based shiny server for annotation and analysis of RNA-Seq transcriptome assemblies	visualisation, web app, bioinformatics	Matt Ritchie	NA
Assembly of transcriptome data in non-model species has become common practice in the last decade thanks to the advent of high-throughput RNA-sequencing platforms and accompanying bioinformatics tools. Trinity is one of the most commonly used tools for transcriptome assembly from Illumina RNA-Seq data and its accompanying functional annotation framework, Trinotate, offers a pipeline for running the various annotation tools and consolidating the results into a single database. Trinotate also includes a web-based graphical user interface for querying the annotations and provide basic visualisation, but its Perl implementation makes it difficult to customise and deploy. Shinotate was developed to provide a modern graphical interface for the analysis of transcriptome annotations, utilising the Trinotate annotation framework to deliver summarised results and insights to users of all skill levels. Shinotate is written in R and uses the `tidyverse` approach to summarise and visualise the data stored in Trinotate and thus can be easily adapted to accommodate custom annotation tables. It serves interactive annotation tables and plots, with search, selection and data export functions.
10:50	Infrastructure and tools for genomic analysis	Peter Hickey	P7	DelayedArray: A tibble for arrays	bioinformatics, performance, big data	Matt Ritchie	click here
High-throughput genomics data are commonly summarised in a feature-by-sample matrix or higher-dimensional array. In R, these have traditionally been stored in-memory, but this is becoming prohibitive for large, contemporary datasets, such as those generated using new genomics technologies like single-cell RNA-seq. Instead, these arrays may be stored on-disk, using the Hierarchical Data Format 5 (HDF5), for example. The Bioconductor project has developed the DelayedArray, which supports different 'backends' to wrap around an in-memory, on-disk, or remotely served representation of an array, providing a unified interface to the data that is familiar to users of ordinary R arrays. In this sense, a DelayedArray is to an array as a tibble is to a data frame. I will provide an overview of the DelayedArray framework, explain the requirements for developing a new backend for a DelayedArray, and highlight example backends for on-disk and remotely served data. I will also demonstrate how user-created packages can extend the capabilities of the DelayedArray and how this has enabled us to analyse large genomics datasets in R that were previously infeasible.
11:10	Infrastructure and tools for genomic analysis	Ramyar Molania	P7	improved normalization of the Nanostring nCounter gene expression data	bioinformatics	Matt Ritchie	NA
The NanoString nCounter gene expression assay uses molecular barcodes and single molecule imaging to detect and count hundreds of unique transcripts in a single reaction. These counts need to be normalized to adjust for variations in assay efficiency, the amount of sample, and other factors. Most users adopt one of the options described in the nSolver analysis software, which involve background correction based on the observed values for 8 negative control probes, a within sample normalization using the observed values for 6 positive control probes, and normalization across samples using reference (“housekeeping”) genes. Including technical replicates is not recommended by the assay developers, but some users do so anyway. Here we present a new normalization called RUV3 which makes vital use of technical replicates and suitable control genes. We illustrate its effectiveness on four quite different datasets, and offer suggestions on the design and analysis of studies involving this technology.
11:30	Infrastructure and tools for genomic analysis	Abdul Abdulmonem A. Alsaleh	P7	Identifying methylation biomarkers for childhood leukaemia from human 450k DNA methylation array data using ABC.RAP R package	data mining, bioinformatics, data analysis	Matt Ritchie	NA
To date, the majority of the available 450k DNA methylation analysis tools focus on single CpG methylation differences. The array based CpG region analysis pipeline (ABC.RAP) R package was developed to analyse normalised human 450k DNA methylation array datasets and applies Student’s t-test and delta beta analysis to identify candidate genes containing multiple differentially methylated CpG sites. In addition, ABC.RAP can profile DNA methylation for any gene of interest, providing a powerful feature for comparison between datasets. We analysed nine publicly available acute leukaemia datasets and identified a panel of 11 genes that were consistently methylated across different cohorts. We used targeted DNA methylation sequencing (MiSeq; Illumina) to sequence blood samples from healthy adults and newborns and also leukaemia xenograft samples and cell lines. The selected panel of genes showed dense DNA methylation in leukaemia samples compared to low-level methylation in control samples consistent with the publicly available 450k array data. ABC.RAP was accepted by the CRAN, and can be accessed on the following site: https://cran.r-project.org/package=ABC.RAP
10:30	Classification and data mining	Christoph Bergmeir	P9	ssc: An R Package for Semi-Supervised Classification	algorithms, models, data mining	Charles Gray	NA
Semi-supervised classification has become a popular area of machine learning, where both labeled and unlabeled data are used to train a classifier. This learning paradigm has obtained promising results, specifically in the presence of a reduced set of labeled examples. We present the R package ssc (https://cran.r-project.org/package=ssc) that implements a collection of self-labeled techniques to construct a classification model. This family of techniques enlarges the original labeled set using the most confident predictions to classify unlabeled data. The techniques implemented in the ssc package can be applied to classification problems in several domains by the specification of a suitable learning scheme. At low ratios of labeled data, it can be shown to perform better than classical supervised classifiers.
10:50	Classification and data mining	Przemyslaw Biecek	P9	DALEX will help you to understand this complex predictive model	visualisation, algorithms, models, data mining, bioinformatics	Charles Gray	click here
Complex machine learning models (random forest/gradient boosting machines/other ensembles) are frequently used in predictive modeling and have many successful applications in predictive and prognostic modeling. Yet in many cases these models are perceived as ,,black-boxes’’ with good accuracy but very complex, hard to understand, structure. In this talk I will present the methodology for exploration, validation and explanation of complex machine learning models. The methodology is implemented in the DALEX library for the R (Descriptive mAchine Learning EXplanations). The methodology contains three sets of explainers:- explainers for individual model predictions, that may be used to better understand key variables that drive model predictions,- explainers for individual variables, that may be used to better understand how model predictions are related with values of a selected feature,- explainers for global model structure, that may be used to assess globally important variables or important structures in the model.Find more about DALEX here: https://pbiecek.github.io/DALEX/I will give a workshop about this package.
11:10	Classification and data mining	Roel Henckaerts	P9	Tree-Based Machine Learning for Insurance Pricing	visualisation, models, Tree-based machine learning	Charles Gray	NA
The goal of this paper is to apply machine learning techniques to insurance pricing, thereby leaving the actuarial comfort zone of generalized linear models (GLMs) and generalized additive models (GAMs). We focus on developing full tariff plans, built from both the frequency and severity of claims. We adapt the cost functions and performance measures used in the algorithms such that the specific characteristics of insurance data are carefully incorporated: highly unbalanced count data with excess zeros on the frequency side and scarce, but potentially heavy-tailed and right-censored data on the severity side. One of the key requirements is the need for transparent, interpretable pricing models which are easily explainable to all stakeholders. We therefore shy away from black box models such as neural networks and rather focus on tree-based machine learning models. Starting from single recursive trees we work towards more advanced ensembles such as bagged trees, random forests and boosted trees. We also present visualization tools to obtain insights from the models by assessing the importance of the different risk factors and their impact on the price of an insurance contract.
11:30	Classification and data mining	Lubomír Štěpánek	P9	Classification and evaluation of facial attractiveness and emotions for purposes of plastic surgery using machine-learning methods and R	algorithms, models, multivariate, machine learning	Charles Gray	click here
Current plastic surgery deals with aesthetic indications such as an improvement of the attractiveness of a smile or other facial emotions. In this work, we have applied machine-learning methods and R to explore how accurate classification of photographed faces into sets of facial emotions (based on Ekman-Friesen FACS scale) is, and furthermore which facial emotions are associated with highest facial attractiveness, measured using Likert scale by a board of observers. Facial image data collected for each of a patient, exposed to an emotion incentive, were processed, landmarked and analyzed using R. Neural networks (neuralnet package) in comparison to Bayesian naive classifiers (e1071 package) and regression trees (rpart package) manifested the highest predictive accuracy of a new face categorization into facial emotions. Decision trees identified that the geometry of a mouth, eyebrows and eyes, respectively, affect in descending order an intensity of a classified emotion. We performed machine-learning analyses using R to point out which facial emotions and their geometry affect facial attractiveness the most, and therefore should preferentially be addressed within plastic surgeries.
10:30	Modeling and algorithms with a health focus	Mauricio Sarrias	P8	Comparing Implementations of Logit Models with Individual Heterogeneity	models, applications, reproducibility, performance	Saskia Freytag	NA
This paper discusses different specifications for Multinomial Logit models that include individual-heterogeneity (Mixed Logit Model, Latent Class MNL, GMNL model, MM-MNL). Due to the ability of these models to include unobserved heterogeneity, they have become quite popular for the empirical analysis of choice decisions. However, due to the complex estimation of these models using simulated maximum likelihood (SML) is quite difficult to compare or even replicate results from different software implementations, even using the same database. For example, the estimates depend on the optimization algorithm, the way on how random numbers are generated; the prime number used if Halton draws are selected, etc. Despite the now widespread use of these methods and these consideration, there appears to be no systematic investigation of the accuracy of these models or a comparison of the performance of the SML estimation routines that now exist in several software. In this article, I compare different implementation (R, Stata, Matlab) of these models by focusing in their ability to retrieve the true parameters from different data generating process and different default setup.
10:50	Modeling and algorithms with a health focus	Daniel Putler	P8	Optimally Locating Opioid Treatment Centers in Under Served Areas Using R and Alteryx	applications, web app, Public Health, Optimization	Saskia Freytag	NA
In 2016 there were nearly 64,000 drug overdose deaths in the US, most of these due to opioids. Treatment of opioid addiction is one of the primary tools for addressing this situation. However, many of the areas hardest hit by opioid use are under served from a treatment perspective. An issue currently impeding the location of treatment facilities is the lack of fine grained location data of opioid abusers.We present a Web application that assists decision makers in locating opioid treatment facilities in under served areas. To do this, estimates of the number of adult opioid abusers at the census tract level are developed using R, based on data from the National Survey on Drug Use and Health and both census tract and microdata sample data from the American Community Survey. The census tract estimates of adult opioid abusers is used, along with data on the locations of existing opioid treatment facilities, to locate new facilities in areas that are further than ten miles from existing facilities, and maximizes the estimated number of abusers within a ten mile radius of the new facilities. The optimization is done using an evolutionary algorithm that is implemented in Alteryx.
11:10	Modeling and algorithms with a health focus	Nicholas Tierney	P8	Maxcovr: Find the best locations for facilities using the maximal covering location problem	visualisation, algorithms, models, applications, space/time, interfaces	Saskia Freytag	NA
Want better wifi at the office? Improved access to healthcare? The maximal covering location problem (MCLP) can help! The MCLP finds optimal locations of facilities to improve their coverage on a set of targets. This means better placed wifi routers and healthcare facilities. Although the MCLP was described in the 1970s, it can be daunting to actually implement as you need to know how to:1) Formulate an optimisation problem2) Make it talk to a solver engine3) Get the data into the appropriate format for the solver to recognise4) Translate the model output into a usable formatIt is challenging, particularly if you are not familiar with optimisation, or techniques such as linear programming. It is, however, a great use case for an R package to abstract away detail you don’t need to worry about. The R package maxcovr provides a set of tools to perform, summarise, and visualise the MCLP, so that you can move on with your analysis, place better cellphone towers, and create better access to health facilities.In this talk, I describe why the MCLP is useful, where it can be applied, and demonstrate of the use of maxcovr, before finally discussing future directions.
11:30	Modeling and algorithms with a health focus	Brianna Hitt	P8	Optimal group testing algorithms for infectious disease detection with the binGroup package	algorithms, applications, binary response; infectious disease testing; pooled testing; screening; sensitivity; specificity	Saskia Freytag	NA
Group testing is the process of amalgamating clinical specimens from individuals (e.g., blood, urine, or saliva) into groups to test for an infectious disease. When disease prevalence is small, the majority of these groups will test negatively. For positive testing groups, there are many algorithmic retesting procedures available to differentiate positive individuals from negative ones. The appeal of group testing to laboratories is that the number of tests needed is significantly less than testing each individual separately. Both estimating the probability of disease infection and identifying positive/negative individuals are goals of group testing. Unfortunately, no package has been available to address the identification goal for the most common group testing algorithms. We present the first functions for identification and make these available in the binGroup package to complement its large set of estimation functions. Our new functions calculate operating characteristics for algorithms and choose the optimal set of group sizes for user specified settings. These new functions allow laboratories to understand how well an algorithm is expected to perform before implementation.
10:30	Working with text	Aneesha Bakharia	P10	Topic Modeling with LDA and NMF from a Qualitative Content Analysis Perspective	visualisation, text analysis/NLP, interfaces	Jim Hester	NA
The Latent Dirichlet Allocation (LDA) and Non Negative Matrix Factorisation (NMF) algorithms are able to find the latent topics within a document collection. Although LDA is specifically designed as a topic modeling algorithm, NMF is able to produce more coherent topics for smaller domain specific document collections. Both algorithms map documents to topics and topics to words and perform soft clustering (i.e., documents and words can belong to multiple topics), making them particularly suitable as qualitative content analysis aids. In this presentation the mathematical underpinning of both algorithms along with their relevant R packages (Topic Models and NMF) will briefly be introduced. The main focus of the presentation however will be on using R to address issues that qualitative researchers encounter when using topic modeling algorithms which include trust, topic quality/coherence, topic interpretation, evidence gathering and model parameter selection. Various tools to visualise the output of topic model will be discussed (i.e., LDAVis) and an intuitive user interface to explore topic models and gather evidence will be built using Shiny.
10:50	Working with text	JeongMin Kwon	P10	From humble data to training data	data mining, applications, text analysis/NLP	Jim Hester	click here
There are lots of imperfect data. User feedback is not trustworthy, and implicit data is not unlabelled and hard to wrangle - and it is hard to use for machine learning and many other ways. But we can use them with changing thinking and data wrangling. In this presentation, I suggest some new ideas for wrangling data to use in machine learning and show our case studies.
11:10	Working with text	Talia Beech	P10	Strategic Capability Analysis for CANSOFCOM	visualisation, text analysis/NLP	Jim Hester	NA
To enhance Canada Special Operations Forces Command competitive advantage in deterring and defeating adversaries as well as collaborating with allies, a strategic capability assessment was conducted to identify current and future capability gaps using concepts from the forecasted future operating environment. Military capability implications are identified and assessed using a wargame-based survey approach across a range of units within the Command. Data collected included ordinal data as well as supplementary comments.Ordinal data is analyzed using the Likert package, with an emphasis on the visualization of the data using stacked bar plots. Comment data is evaluated using R text mining packages with some emphasis on preprocessing steps to simplify text mining tasks. Results are used as a foundation for implementing constructive institutional change across the Command.
11:30	Working with text	Thomas Klebel	P10	jstor: An R package for Analysing Scientific Articles	data mining, text analysis/NLP	Jim Hester	NA
The interest in the (quantitative) analysis of textual data has increased considerably over the last few years. For researchers investigating the scholarly literature the full text archive of JSTOR (http://www.jstor.org) offers a rich and diverse set of journal articles and other texts. Through its service Data for Research (http://www.jstor.org/dfr/), JSTOR gives researchers the opportunity to analyse this data, by delivering metadata, n-grams and, upon special request, full-text materials. jstor (https://tklebel.github.io/jstor/) enables researchers to easily import the supplied metadata to R. These metadata can either be analysed on their own, or be used in conjunction with n-grams or full-text-data. The presentation will show how jstor supports investigations of scholarly literature, covering the analysis of n-grams and citation analysis. Besides introducing possible applications, the paper will also discuss limitations regarding data quality and possible solutions thereof.
10:30	New dev for temporal data	Earo Wang	AUD	tsibble: The 15th time series standard	visualisation, space/time	Natalia da Silva	NA
The conventional matrix structure that underlies time series models in R does not easily accommodate a few complications, such as multiple variables, heterogeneous data types, low time resolutions, implicit missing values, and multilevel. This work addresses the broader issues of better data structures and modern data pipelines for analysing and visualising temporal-context data. We extend the tidy data concept to temporal data, and note that the “molten” data structure is flexible enough to handle heterogeneity, low time resolutions, and implicit missing values. There are two constraints required to turn the “molten” data into a valid temporal data: (1) an explicitly declared index variable containing timestamps; (2) a constraint uniquely identifies the multiple units of measurements, which is referred to as a “key”. A syntactical approach is introduced to describe nested or crossed data structure, which employs the “key”. Based on the tidy temporal data, a data pipeline is discussed and formulated to facilitate time-based transformation and visualisation. A case study is included to demonstrate the tidy structure and the data pipeline ideas and usage.
10:50	New dev for temporal data	Rob Hyndman	AUD	Tidy forecasting in R	space/time	Natalia da Silva	click here
The forecast package in R is widely used and provides good tools for monthly, quarterly and annual time series. But it is not so well-developed for daily and sub-daily data, and it does not interact easily with modern tidy packages such as dplyr, purrr and tidyr.I will describe our plans and progress in developing a collection of packages to provide tidy tools for time series and forecasting, which will interact seamlessly with tidyverse packages, and provide functions to handle time series at any frequency.
11:10	New dev for temporal data	Mitchell O'Hara-Wild	AUD	fasster: Forecasting multiple seasonality with state switching	algorithms, models, streaming data, timeseries	Natalia da Silva	NA
Forecasting time-series which contain multiple seasonal patterns requires flexible modelling approaches, and the need for continuously updating models emphasises the importance of fast model estimation. In response to shortcomings in current models, a new model is proposed which brings the desirable qualities of speed, flexibility and support for exogenous regressors into a state space model. This proposed model also introduces state switching, which captures groups of irregular multiple seasonality by switching between states. The functionality of the proposed model extends beyond forecasting, by allowing for model based time-series decomposition, imputation of missing values, and support for streaming data.This model is available as an R package (mitchelloharawild/fasster), which provides formula based model specification, and uses tidy data structures (tsibble) and APIs which will later become familiar in forecast's next iteration: tidyforecast.
11:30	New dev for temporal data	Thiyanga Talagala	AUD	seer: R package for feature-based forecast-model selection	algorithms, models, time series	Natalia da Silva	NA
The seer package provides a novel framework for forecast model selection using time series features. We call this framework FFORMS (Feature-based FORecast Model Selection). The underlying approach involves computing a vector of features from the time series which are then used to select the forecasting model. The model selection process is carried out using a classification algorithm -- we use the time series features as inputs, and the best forecasting algorithm as the output. The classification algorithm can be built in advance of the forecasting exercise (so it is an “offline” procedure). Then, when we have a new time series to forecast, we can quickly compute its features, use the pre-trained classification algorithm to identify the best forecasting model, and produce the required forecasts. Thus, the “online” part of our algorithm requires only feature computation, and the application of a single forecasting model, with no need to estimate large numbers of models within a class, or to carry out a computationally-intensive cross-validation procedure. This framework is compared against several benchmarks and other commonly used forecasting methods.
13:00	Keynote	Danielle Navarro	AUD	R for Psychological Science?	NA	Danielle Navarro	click here
Traditionally, R has been viewed as a language for data science and statistics. In the social sciences it has been extremely popular with researchers at the more quantitative end of the spectrum - but uptake has been less widespread outside of the more statistically inclined. I don't think the R language needs to be limited in this way. Since 2011 I've been teaching introductory research methods classes for undergraduates using R, running programming classes for R with postgraduate students, doing my own data analysis with R, implementing cognitive models with R and occasionally even running behavioural experiments in R. In this talk I reflect on some of these experiences - the good, the bad and the ugly - and discuss prospects and challenges for wider adoption R as a tool within the psychological sciences.
14:00	Applications in big data	Ansgar Wenzel	P10	A closer look at UK MOT results - Why does my car always fail?	applications	Adam Gruer	NA
We present an analysis of the last 10 years of MOT results in the UK, with a particular focus on when cars fail and why. This is based on an open data set provided by the UK Government on the MOT, an annual car check mandatory for all vehicles older than three years. We hope that the results of this can inform customer choice when purchasing used (or new) vehicles as well as provide some interesting results. In particular, we consider geographical, vehicle, owner data, and interactions between these groups to identify the main drivers for a car to fail an MOT. We also consider the severity of a fail, eg. a non-working number plate light versus unsafe brake discs. We use these results to inform the training and design of a model that we train to predict failure (or passing) of a given vehicle. Additionally, we present results that were found using some less-common techniques. We also consider whether there are significant regional differences in pass or fail rates for different car brands or models.
14:20	Applications in big data	Kevin Kuo	P10	Claims reserving in general insurance with R and Keras	algorithms, models, data mining, applications, insurance	Adam Gruer	NA
In loss reserving, actuaries are concerned with estimating liabilities from current and future, yet to be reported, claims. In this session, we first provide an overview of the loss reserving problem and current techniques. We then frame the loss reserving problem as a predictive modeling problem, and propose a deep learning approach to solve it. We benchmark the model against existing techniques, then discuss applications of deep learning to other problems in actuarial science and insurance.
14:40	Applications in big data	Sangeeta Bhatia	P10	Big Brother is Watching - Using Digital Disease Surveillance Tools for Near Real-Time Forecasting	models, applications, Epidemiology	Adam Gruer	NA
In our increasingly interconnected world, it is crucial to understand the risk of an outbreak originating in one country/region and spreading to the rest of the world. Digital disease surveillance tools such as ProMed, HealthMap etc. have the potential to serve as important early warning systems as well as complement the field surveillance data during an ongoing outbreak. While there are a number of systems that carry out digital disease surveillance, there is as yet a lack of tools that can compile and analyse the generated data to produce easily understood actionable reports. I will present a flexible statistical model that uses different streams of data (such as disease surveillance data, mobility data etc.) for short-term incidence trend forecasting. I will also highlight an example of disaggregating aggregated data to obtain incidence information at a fine spatial scale. This could be particularly important in instances where information at sub-national levels is lacking or incomplete. The model has been developed in R and will be made available as a R package as well as through a website for use by non-technical stakeholders.
14:00	R Consortium projects	David Smith	P9	The voice of the R community	community/education	NA	click here
The R Consortium commissioned a broad survey of the R community in 2017, with over 3500 respondents around the world giving their insights and feedback on their usage of R. With such a large sample size, the data gave us some quantitative insights that can help continue the massive growth of R and establish sustainability in the ecosystem. I'll discuss some of the results from that survey, and other ways that the R Consortium helps to amplify the voice of the R community.
14:20	R Consortium projects	Joseph Rickert	P9	Sustainable community investment in action - a look at some of the R Consortium Funded Grant Projects and Working Groups.	community/education	NA	NA
R Consortium has funded over $500,000 USD to R community members improving the community and technical infrastructure for the benefit of the R ecosystem. In addition, the working groups program has worked to drive discussion and alignment on key areas such as industry adoption, package health, and educational standards.In this talk, we will showcase several of the working groups and funded projects stewarded by R Consortium. This will provide an opportunity for the audience to understand the work being done, and also opportunities on how they could take part.
14:40	R Consortium projects	NA Various funded researchers	P9	What we are doing with the R Consortium funds	community/education	NA	NA
In this last session, we will tell you about the work that we are doing or about to do with our grant. We will also take questions about applying for project funding.
14:00	Community development	Peter Dalgaard	P6	What's in a name? 20 years of R release management	community/education, History of R	Miles McBain	NA
In this talk, I will go through the history of R releases since 1997. I will discuss the role of the R Core Team with special emphasis on development principles and release management issues. A few "war stories" will also be included. Some light will be thrown on the choice of release names since 2011.
14:20	Community development	Dennis Irorere	P6	R labs Africa	community/education	Miles McBain	NA
In this century there is a pent-up demand for “the next big thing”, and R labs Africa is at the right time to lead what is important to many in the areas of big data, data science, machine learning and artificial intelligence. I will reference the quote, “Talent is everywhere, it only needs opportunity to emerge” and this is what the R labs Africa will be about. That is, providing opportunity to marginalized and at risk communities all over Africa to learn about Data science with R through group mentor-ship, real world challenges and providing regular meet up. There will be an Annual R converge where everyone meets to talk about the future of R, share ideas and motivate one another.When access to knowledge is democratized, we see meaningful social development, because it takes a brilliant mind from a disadvantage background to create transformation solutions that solve problems within his or her domains of life experiences. When brilliance and naïve context find a nexus, true local solutions are created. These are the opportunities R labs Africa will bring to Africa. Already, the Akure R user group has reached out to about 62 young minds in the spans of 3 months.
14:40	Community development	Joe Kliegman	P6	nextjournal	community/education, reproducibility	Miles McBain	NA
Nextjournal is a web-based application for creating living documents with executable code. It is a notebook-style multi-language programming and presentation environment focused on long-term reproducibility. It is designed so that code the articles contain can be run and reused years later by automatically versioning the full software stack without requiring any knowledge of version control software. It was created to facilitate seamless collaborative development and strives to make research easier to reproduce, reuse, and trust.
14:00	Modeling and algorithms	Teck Kiang Tan	P7	Doubly Classified Models with R	models, applications, community/education	Simone Blomberg	NA
When we look at a cross tabulation, could we see any pattern out from it? When the table is big, it is extremely hard to discovery pattern by examining the cell frequencies. Doubly classified models are a set of statistical models aim to reveal patterns out from a cross classification table. There are substantial applications of these models. For instance, social researchers who are interested to find out intergenerational social mobility will find this way of analysing refreshing. These models are not new-fangled, but standard textbooks cover only a few of them, and journal articles are usually too technical to grasp the idea behind the model. For those with little mathematical statistic background, it causes great difficult to understand them. The talk focuses on conveying these models using a new graphical table tool called symbolic tables to give the basic idea about the models. Together, using a few standard R functions, mainly generalized linear model, doubly classified models can be set up easily. Real life examples will be illustrated, extracted mainly from the book titled “Doubly Classified Model with R”, written by the author.
14:20	Modeling and algorithms	Sourav Das	P7	A routine for measuring the nonstationarity of a time series	space/time, streaming data	Simone Blomberg	NA
Since the 1960's nonstationarytime series have beeninvestigated extensively. Methodology and theory haveevolved rapidly since Dahlhaus' construction of locallystationary processes in the 1990's. However much of thetheory in above constructions rely on assumptions ofsmoothness on the time varying transfer function. Howeverwhen modelling real data, tools for assessing such regularityconditions are yet to be developed. We have proposed a methodology that allows a domain expert to measure the non-stationarity of a time series using principles of non-parametric regression. In this talk we present a R routine that can be used to easily compute the proposed non-stationarity index.
14:40	Modeling and algorithms	Aya Alwan	P7	Observation driven Conway-Maxwell Poisson count data models	algorithms, models	Simone Blomberg	NA
Conway-Maxwell-Poisson (CMP) distributions is one of the flexible generalisation of the Poisson distribution that gained recent attention due to its flexibility in modelling both overdispersed and underdispersed count data. The main hindrance to their wider use in practice seems to be the inability to directly model the mean of counts, making them not compatible with nor comparable to competing count regression models, such as the log-linear Poisson, negative binomial or generalized Poisson regression models. In this talk, we will review how CMP can be parametrized via the mean, so that simpler and more easily-interpretable mean-models can be used, such as a log-linear model. A newly developed R package to fit the model to data will be discussed. Some simulated and real datasets will be used as demonstration.
14:00	Programming, performance and productivity	Martin Maechler	P8	Helping R to be (even more) accurate	algorithms, reproducibility, performance, numerical accuracy	Roger Peng	NA
R has originally primarily been the _""super calculator""_for all applied statisticians and data scientists. Notably, at the heart ofmany statistical modelling algorithms are computations withprobabilities, risk measures or densities for Maximum Likelihood.In some cases these can go wrong _""without notice""_ because of inherentlimitations in computer arithmetic such as cancellation and underflow. 1. See how useRs can use R smartly in order to not lose precisionunnecessarily. Why R's reliable distribution related functions `[dpq]()`all* have arguments such as `log.p` and `lower.tail`, e.g., dnorm(x, .., log = FALSE); pnorm(q, .., lower.tail = TRUE, log.p = FALSE)Further, why you should know about `log1p()` and `expm1()`. 2. The CRAN R package `Rmpfr` provides an (S4 classed) interface to the GNU.MPFR library for arbitrary precise computation when needed, e.g., fordetermining numerically reliable computations in R itself or in our CRANpackage `copula`.
14:20	Programming, performance and productivity	Tomas Kalibera	P8	Preventing and Detecting Memory Protection Bugs in Packages	programming packages, finding bugs in packages	Roger Peng	NA
R's garbage collector (GC) ensures that memory used for R values is automatically reclaimed when they become unreachable via pointers and hence no longer needed. R code is handled automatically, but C code must protect from the GC the R values it needs and unprotect them after. Forgetting to protect and/or to unprotect (protect bug) often makes R crash but also can lead to incorrect results. It is not uncommon that old protect bugs are uncovered much later by inconsequential code changes. These bugs are common and hard to find, and thus R offers tools to detect them. `gctorture` helps testing by increasing the chance a protect bug will crash R and will do so sooner after the code with the bug executes. `rchk` is a static analysis tool that identifies potential protect bugs in C code without executing it; it is used regularly to check incoming CRAN packages. Finally, protect bugs can be prevented by following several simple programming rules. The talk is intended for package developers and everyone who write C code to work with R.
14:40	Programming, performance and productivity	Kelly O'Briant	P8	How to Play with and Integrate DevOps Technologies in an R Data Science Workflow	Cloud computing	Roger Peng	NA
Over the last year I’ve become obsessed with trying to encourage the data science community to explore and exploit DevOps and cloud computing technology. This isn’t often an easy undertaking. Most people (data scientists or not) are skeptical of deviating from the tools and workflows they’ve come to rely on. This talk will feature case studies in developing data science products and workflows in the cloud, and how working with these tools can open up a world of new possibilities within the intersection of DevOps and Data Analytics.KEYS Topics to discuss:- How DataOps can address the growing scope of data science tasking- Where to start when you start exploring cloud services- How to work through functionality/engineering challenges in a cloud environment- Case studies in data science product engineering and deployment
15:30	Keynote	Jenny Bryan	AUD	Code smells and feels	NA	Thomas Lumley	click here
"Code smell" is an evocative term for that vague feeling of unease we get when reading certain bits of code. It's not necessarily wrong, but neither is it obviously correct. We may be reluctant to work on such code, because past experience suggests it's going to be fiddly and bug-prone. In contrast, there's another type of code that just feels good to read and work on. What's the difference? If we can be more precise about code smells and feels, we can be intentional about writing code that is easier and more pleasant to work on. I've been fortunate to spend the last couple years embedded in a group of developers working on the tidyverse and r-lib packages. Based on this experience, I'll talk about specific code smells and deodorizing strategies for R.