Time | Session | Presenter | Venue | Title | Keywords | Chair | Slides |
---|
9:05 | Keynote | Thomas Lin Pedersen | AUD | The Grammar of Animation | NA | Rob J Hyndman | click here |
In the world of data visualisation much work has been put into defining a grammar for both static and interactive graphics. These efforts has often been coupled to the development of visualisation frameworks where the grammar has been reflected in the API design. Less attention has been devoted to a grammar of animation, and subsequently animation frameworks has often missed the breadth and composability that are the hallmark of grammar-driven visualisation frameworks. In this talk I will justify and present a grammar of animation and position it in relation to graphics and interactivity grammar, thus creating a clear division of responsibility between the three domains. I will present an R implementation of the grammar of animation which builds on top of the ggplot2 framework and made available as the gganimate package, Using examples with gganimate I'll show how the proposed grammar can be used to break down, and reason about, animated data visualisation, and how the grammar succinctly can describe very diverse animation operations.
|
10:30 | Applications in health and environment | Mark Padgham | P8 | tRansport tools for the World (Health Organization) | applications, reproducibility, community/education, space/time, big data | Paula Andrea | NA |
The World Health Organization (WHO) contracted us to provide actionable evidence for the redesign of urban transport policies to help rather than hinder human health. That means more active transport. Designing cost-effective policies to get people walking and cycling requires insight into where, when, how, and why people currently travel. This is challenging, especially in cities with limited resources, data, or analysis capabilities. We briefly describe some technical details of our 'Active Transport Toolkit' (ATT), but the primary focus will be the context that led to the WHO contract and where we plan to go next. We argue that useRs are well-placed to provide openly available, global-scale, transparent tools for policy making. It was the flexibility of the R language and the supportiveness of its community - notably including ROpenSci, which hosts two of our packages - that enabled us to develop the ATT in a way that makes it flexible enough to capture citys' unique characteristics while providing a consistent user interface. The talk will conclude with a outline of lessons learned from the perspective of others wanting to create R tools to inform policy. |
10:50 | Applications in health and environment | Philip Dyer | P8 | Models of global marine biodiversity: an exercise in mixing R markdown, parallel processing and caching on supercomputers | models, applications, reproducibility, performance, big data | Paula Andrea | click here |
R has become the standard language in ecology for statistics and modelling. If a technique has been published in mathematical ecology it has an R package. Even the data sets have an R package! The size of data sets in ecology has been growing to the point where global analysis of ecological data can be considered. At the same time powerful statistical techniques that rely on randomly permuting the data, such as bootstrapping, have become more popular. These are exciting times, but how do we get R to process our large data sets with computationally expensive algorithms without waiting forever to get results? For those new to R, or at least new to big data in R, I have some tips, techniques and packages to help you get going. I have benefited from using R markdown and Knitr to make short transcript files. I have also made use of caching to avoid recalculating big models and using parallel processing to calculate the models faster in the first place. |
11:10 | Applications in health and environment | Chris Hansen | P8 | Enabling Analysts: Embracing R in a National Statistics Office | Official Statistics | Paula Andrea | NA |
Stats NZ has recently adopted R as an approved analytical tool, and more recently for use in production of official outputs. Since adoption, R has had significant uptake, and has been a great enabler for analysts. R is more expressive and flexible than the existing tools, allowing them to more easily solve a variety of problems. R is deployed on powerful servers, so users have a generous supply of memory and cores, meaning large datasets can be handled, and long-running computations parallelised. Analysts access R using R Studio Server, and this IDE itself has had a number of positive impacts--the use of RStudio projects and R markdown documents in particular helps analysts work in a more organised way, and ensures work is reproducible. Our statistical platforms can now also use R. This is done via OpenCPU which enables remote exection of function via an HTTP API. That is, OpenCPU can be used to call functions in internally developed packages as web services. This has proven useful as we transition to a more service oriented architecture. In this talk we describe the R environment at Stats NZ, it’s implications for analysts, and provide examples of its use in practice. |
11:30 | Applications in health and environment | Tracy Huang | P8 | Developing an Uncertainty Toolbox for Agriculture: a closer look at Sensitivity Analysis | visualisation, applications, web app, space/time, big data, R6 and Reference Classes | Paula Andrea | NA |
Digiscape is one of 8 Future Science Platforms in CSIRO focussed on delivering new analytics in the digital age to better inform agricultural systems in the face of uncertainty. The Uncertainty Toolbox is one of 15 projects within Digiscape trying to make a difference to the way models are interpreted, reported and communicated in practice for decision-making. Uncertainty is front and centre of every modelling problem but it is sometimes difficult to quantify and challenging to communicate. The Sensitivity Analysis workflow focuses on developing a general framework for sensitivity analysis to inform the modeller about key parameters of interest and refine the model so it can be used in a robust way to make predictions and forecasts with uncertainties. We focus on methods applicable for large scale, non-monotonic problems that develop variance based approaches to sensitivity analysis using emulators. As such, the framework for developing this workflow in R becomes important for transparency and usability. We will outline the design steps for constructing this workflow using the latest object oriented systems available in R and give a demonstration of the tool using Shiny. |
10:30 | Models and methods for biology and beyond | Zachary Foster | P9 | Taxa and metacoder: R packages for parsing, visualization, and manipulation of taxonomic data | visualisation, data mining, applications, databases, bioinformatics, Taxonomy | Anna Quaglieri | NA |
Modern microbiome research is producing datasets that are difficult to manipulate and visualize due to the hierarchical nature of taxonomic classifications. The “taxa” package provides a set of classes for the storage and manipulation of taxonomic data. Classes range from simple building blocks to project-level objects storing multiple user-defined datasets mapped to a taxonomy. It includes parsers that can read in taxonomic information in nearly any form. It also provides functions modeled after dplyr for manipulating a taxonomy and associated datasets such that hierarchical relationships between taxa as well as mappings between taxa and data are preserved. We hope taxa will provide a basis for an ecosystem of compatible packages. We have also developed the metacoder package for visualizing hierarchical data. Metacoder implements a novel visualization called heat trees that use the color and size of nodes and edges on a taxonomic tree to quantitatively depict up to 4 statistics. This allows for rapid exploration of data and information-dense, publication-quality graphics. This is an alternative to the stacked barcharts typically used in microbiome research. |
10:50 | Models and methods for biology and beyond | Saswati Saha | P9 | Multiple testing approaches for evaluating the effectiveness of a drug combination in a multiple-dose factorial design. | applications, multivariate, Factorial Design, Drug Combination | Anna Quaglieri | NA |
Drug combination trials are often motivated from the fact that using existing drugs in combination might prove to be more productive than the existing drug alone and less expensive than producing an entirely new drug. Several approaches have been explored for developing statistical methods that compare fixed (single) dose combinations to its component. However, the extension of these approaches to a multiple dose combination clinical trial is not always so simple. Considering these facts we have proposed three approaches by which we can provide confirmatory assurance that combination of two or more drugs is more effective than the component drug alone. These approaches involved multiple comparisons in multilevel factorial design where the type 1 error is controlled by bonferroni test, bootstrap test, and a union intersection test where the least favorable null configuration has been considered. We have also built a R package implementing the above approaches and in this presentation we would like to demonstrate how this R package can be used in a drug combination trial. We will also demonstrate how these three approaches are performing when benchmarked with an existing approach. |
11:10 | Models and methods for biology and beyond | Bill Lattner | P9 | Modeling Heterogeneous Treatment Effects with R | models, applications | Anna Quaglieri | click here |
Randomized experiments have become ubiquitous in many fields. Traditionally, we have focused on reporting the average treatment effect (ATE) from such experiments. With recent advances in machine learning, and the overall scale at which experiments are now conducted, we can broaden our analysis to include heterogeneous treatment effects. This provides a more nuanced view of the effect of a treatment or change on the outcome of interest. Going one step further, we can use models of heterogeneous treatment effects to optimally allocate treatment.In this talk will provide a brief overview of heterogeneous treatment effect modeling. We will show how to apply some recently proposed methods using R, and compare the results of each using a question wording experiment from the General Social Survey. Finally, we will conclude with some practical issues in modeling heterogeneous treatment effects, including model selection and obtaining valid confidence intervals. |
11:30 | Models and methods for biology and beyond | Shian Su | P9 | Glimma: interactive graphics for gene expression analysis | visualisation, applications, bioinformatics | Anna Quaglieri | NA |
Modern RNA sequencing produces large amounts of data containing tens of thousands of genes. Exploratory and statistical analysis of these genes produces plots or tables with many data points. Glimma is a Bioconductor package that provides interactive versions of common plots from limma, a widely used gene expression analysis package. It allows researchers to explore the statistical summary of their data, with cross-chart interactions providing greater insight into the behaviours of specific genes. Interactivity allows genes of interest to be quickly interrogated on the summary graphic which provides better context than searching through spreadsheets. Cross-chart interactions display useful additional content that would otherwise require manual querying. Glimma produces HTML pages with custom D3 Javascript that handles interactions completely independent from R, the resulting plots to easily be shared with researchers without the need for software dependencies beyond a modern browser. |
10:30 | Learning and teaching | François Michonneau | P7 | Lessons learned from developing R-based curricula across disciplines | community/education | Sam Clifford | click here |
The Carpentries is a non-profit volunteer organization that teaches scientists with no or little programming experience foundational skills in coding, data science, and best-practices for reproducible research. We offer 2-day workshops for a variety of disciplines including Ecology, Genomics, Geospatial analysis, and Social Sciences. With 1300+ instructors who have taught 500+ workshops on all continents, we worked with our community of instructors to assemble evidence-based curricula using results from research on teaching and learning. We have developed detailed short- and long-term assessments to evaluate the effectiveness and level of satisfaction of our learners after attending a workshop, as well as the impact on their research and careers 6 months or more afterwards. We find that workshop participants program more often, are more confident, and use programming practices that the report make them more efficient and reproducible. Here, we will present the lessons we learned about developing curricula based on teaching R to novices across diverse disciplines, and the strategies we use to instill the desire to continue learning after attending our workshops. |
10:50 | Learning and teaching | Matthias Gehrke | P7 | Student Performance and Acceptance of Technology in a Statistics Course Based on R mosaic - Results from a Pre- and Post-Test Survey | community/education, teaching | Sam Clifford | click here |
In the last years there is movement towards simulation-based inference (e.g., bootstrapping and randomization tests) in order to improve students' understanding of statistical reasoning (see e.g. Chance et al. 2016). The R package mosaic was developed with a "minimal R" approach to simplify the introduction of these concepts (Pruim et al. (2017)). With a pre and post survey we analysed whether students improved in understanding as well as in acceptance of R during a one semester statistics course in economically related Bachelor and Master programs. These courses were held by different lecturers at multiple locations in Germany. At our private university of applied sciences for professionals studying while working the use of R is compulsory in all statistical courses.While conceptual understanding was evaluated by a subset of the modified CAOS inventory (like Chance et al. (2016)) the acceptance and use of technology was collected by using an adopted version of UTAUT2 (Venkatesh et al. (2012)). |
11:10 | Learning and teaching | Mette Langaas | P7 | Teaching statistics - with all learning resources written in R Markdown | community/education, teaching | Sam Clifford | click here |
In applied courses in statistics it is important for the student to see a mix of theory, practical examples and data analyses. Being able to study the R code used to produce the data analyses, and to run and modify the R code will give the student hands on experience, which again may lead to increased theoretical understanding.I will tell about my experiences with producing and using learning material written in R Markdown in two courses in statistics at the Norwegian University of Science and Technology. One course is at the master level (Generalized linear models) with few students (35) and a mix of plenary and interactive lectures. The other course is at the bachelor level (Statistical learning) with more students (70). |
10:30 | Data handling | Chester Ismay | AUD | Statistical Inference: A Tidy Approach using R | visualisation, community/education, statistical inference, tidyverse community | Jenny Bryan | click here |
How do you code-up a permutation test in R? What about an ANOVA or a chi-square test? Have you ever been uncertain as to exactly which type of test you should run given the data and questions asked? The `infer` R package was created to unite common statistical inference tasks into an expressive and intuitive framework to alleviate some of these struggles and make inference more intuitive. This talk will focus on developing an understanding of the design principles of the package, which are firmly motivated by Hadley Wickham's tidy tools manifesto. It will also discuss the implementation, centered on the common conceptual threads that link a surprising range of hypothesis tests and confidence intervals. Lastly, we'll dive into some examples of how to implement the code of the `infer` package via different data sets and variable scenarios. The package is aimed to be useful to new students of statistics as well as seasoned practitioners. |
10:50 | Data handling | Thomas Lumley | AUD | Subsampling and one-step polishing for generalised linear models | algorithms, models, databases, big data | Jenny Bryan | NA |
Using only a commodity laptop it's possible to fit a generalised linear model to a dataset from about a million to a billion rows by first fitting to a subset and then doing a one-step update. The method depends on a bit of asymptotic theory, some sampling, the Fisher scoring algorithm, efficient R-database interfaces, and a little of the tidyverse. |
11:10 | Data handling | James Hester | AUD | Glue strings to data in R | Package development | Jenny Bryan | NA |
String interpolation, evaluating a variable name to a value within a string, isa feature of many programming languages including Python, Julia, Javascript,Rust, and most Unix Shells. R's `sprintf()` and `paste()` functions providesome of this functionality, but have limitations which make them cumbersome touse. There are also some existing add on packages with similar functionality,however each has drawbacks.The glue package performs robust stringinterpolation for R. This includes evaluation of variables and arbitrary R code,with a clean and simple syntax. Because it is dependency-free, it is easy toincorporate into packages. In addition, glue provides an extensible interfaceto perform more complex transformations; such as `glue_sql()` to constructSQL queries with automatically quoted variables.This talk will show how to utilize glue to write beautiful code which iseasy to read, write and maintain. We will also discuss ways to best use glue whenperformance is a concern. Finally we will create custom glue functions tailoredtowards specific use cases, such as JSON construction, colored messages, emojiinterpolation and more. |
11:30 | Data handling | Max Kuhn | AUD | Data Preprocessing using Recipes | algorithms, models | Jenny Bryan | click here |
The recipes package can be used as a replacement for model.matrix as well as a general feature engineering tool. The package uses a dplyr-like syntax where a specification for a sequence of data preprocessing steps are created with the execution of these steps deferred until later. Data processing recipes can be created sequentially and intermediate results can be cached. An example is used to illustrate the basic recipe functionality and philosophy. |
10:30 | Statistical modeling | John Fox | P10 | New Features in the car and effects Packages | visualisation, models | Matteo Fasiolo | click here |
The widely used car and effects packages are associated with Foxand Weisberg, An R Companion to Applied Regression, the thirdedition of which will be published this year. In preparation, wehave released the substantially revised version 3.0-0 of the carpackage and version 4.0-1 of the effects package.The car package focuses on tools, many of them graphical, that areuseful for applied regression analysis (linear, generalized linear, mixed-effects models, etc.), including tools for preparing, examining, and transformingdata prior to specification of a regression model, and tools thatare useful for assessing regression models that have been fit todata. The effects packages focuses on graphical methods forinterpreting regression models that have been fit to data.Among the many changes and improvements to the packages are areconceptualization of effect displays, which we call "predictoreffects"; the ability to add partial residuals to effect plots ofarbitrary complexity; simplification to the arguments of plottingfunctions; new and improved functions for summarizing and testingstatistical models; and improved methods for selecting variabletransformations. |
10:50 | Statistical modeling | Rainer Hirk | P10 | mvord: An R Package for Fitting Multivariate Ordinal Regression Models | algorithms, models, applications, multivariate | Matteo Fasiolo | NA |
The R package mvord implements composite likelihood estimation in the class of multivariate ordinal regression models with probit and a logit link. A flexible modeling framework for multiple ordinal measurements on the same subject is set up, which takes into consideration the dependence among the multiple observations by employing different error structures. Heterogeneity in the error structure across the subjects can be accounted for by the package, which allows for covariate dependent error structures. In addition, regression coefficients and threshold parameters are varying across the multiple response dimensions in the default implementation. However, constraints can be defined by the user if areduction of the parameter space is desired. The proposed multivariate framework is illustrated by means of a credit risk application. |
11:10 | Statistical modeling | Joachim Schwarz | P10 | Partial Least Squares with formative constructs and a binary target variable | PLS, pslpm package, formative constructs, binary target variable | Matteo Fasiolo | NA |
During the last years, the use of PLS became more and more important for the modelling of dependencies between latent variables as an alternative to classical structural equation modelling. However, a non-metric target variable in combination with formatively measured constructs is still a particular challenge for the PLS-approach.By using the plspm package (Sanchez/Trinchera/Russolillo 2017), we tested a model from the human resources management field. Main goal of this model is to examine the moderating and mediating role of meaning at work for the relationship between several social, personal, environmental and motivational job characteristics and the intention to quit as a manifest binary target variable. Coping with the complexity of the model, consisting of more than 70 latent variables, all formatively measured, many of them one indicator constructs, there are some pitfalls in the application of the plspm package, but due to the flexibility of R, it is possible even to evaluate such a complex model. |
11:30 | Statistical modeling | Murray Cameron | P10 | Exceeding the designer's expectation | algorithms, models, applications | Matteo Fasiolo | click here |
Statistical methods and their software implementation are generally designed for a particular class of applications. However, the nature of data, analysis and statisticians is that uses of the methods are envisaged that extend the application. Sometimes the reason is the nature of the data, sometimes it is a new type of model and sometimes it is the limitations of the software available. Software for regression and for generalised linear models have regularly been used in 'non-standard' ways.We will discuss some examples, considering some changepoint models in particular and emphasise some old lessons for software developers. |
10:30 | Better data performance | David Cooley | P6 | Starting with geospatial data in Shiny, and knowing when to stop | visualisation, databases, web app, performance, spatial | David Smith | NA |
Theme:Coupling R with geospatial databases to reduce the calculations and data in R and improve shiny app speedLike any web page, Shiny apps need to be quick and responsive for a better user experience. Doing complex calculations and storing large data objects will slow the app. Therefore, it's often desirable to remove as much of this as possible from the app. The talk will demonstrate- Using MongoDB as a geospatial database- Querying & returning geospatial data to R from MongoDB- Comparison and benchmarking of geospatial operations in R vs on the database server- Applying this to a shiny app with a demonstration, highlighting the pros & cons- Introducing the latest updates to the `googleway` package for displaying data and using Google Map tools through R- Using Google Maps to trigger database queries and operations |
10:50 | Better data performance | Jeffrey O. Hanson | P6 | prioritizr: Systematic conservation prioritization in R | reproducibility, space/time, performance, conservation | David Smith | NA |
Biodiversity is in crisis. To prevent further declines, protected areas need to be established in places that will achieve conservation objectives for minimal cost. However, existing decision support tools tend to offer limited customizability and can take a long time to deliver solutions. To overcome these limitations and help prioritize conservation efforts in a transparent and reproducible manner, here we present the prioritizr R package. Inspired by the tidyverse principles, this R package provides a flexible interface for articulating, building and solving conservation planning problems. In contrast to existing tools, the prioritizr R package uses integer linear programming (ILP) techniques to mathematically formulate and solve conservation problems. As a consequence, the prioritizr R package can find solutions that are guaranteed to be optimal and in record time. By finding solutions to problems that are relevant to the species, ecosystems, and economic factors in areas of interest, conservation scientists, planners, and decision makers stand a far greater chance at enhancing biodiversity. For more information, visit https://github.com/prioritizr/prioritizr. |
11:10 | Better data performance | Remy Gavard | P6 | Using R to pre-process ultra-high-resolution mass spectrometry data of complex mixtures. | algorithms, applications | David Smith | NA |
Scientists are able to determine over hundreds of thousands of components in crude oil using Fourier transform ion cyclotron resonance mass spectrometry (FTICR-MS). The statistical tools required to analyse the mass spectra struggle to keep pace with advancinginstrument capabilities and increasing quantities of data. Today most ultrahigh resolution analyses for complex mixture samples are based on single, labour-intensive, experiments.We present a new algorithm developed in R named Themis to jointly pre-process replicate measurements of a complex sample analysed using FTICR-MS. This improves consistency as a preliminary step to assigning chemical compositions, and the algorithm has a quality control criterion. Through the use of peak alignment and an adaptive mixture model-based strategy, it is possible to distinguish true peaks from noise.Themis demonstrated a more effective removal of noise-related peaks and the preservation and improvement of the chemical composition profile. Themis enabled the isolation of peaks that would have otherwise been discarded using traditional peak picking (based upon signal-to-noise ratio alone) for a single spectrum. |
11:30 | Better data performance | Joshua Bon | P6 | Semi-infinite programming in R | algorithms, models | David Smith | NA |
Semi-infinite programming (SIP) is an optimisation problem where, generally, there are a finite number of variables but an infinite number of (parametrised) constraints. We show how to optimise simple SIP problems in R, in particular SIP for shape-constrained regression. The package sipr (under development) will be presented and collaboration sought from those in attendance. |
13:00 | Keynote | Bill Venables | AUD | Adventures with R: Two stories of analyses and a new perspective on data | NA | Paul Murrell | click here |
I will discuss two recent analyses, one from psycholinguistics and the other from fisheries, that show the versatility of R to tackle the wide range of challenges facing the statistician/modeller adventurer. I will conclude with a more generic discussion of the status and role of data in our contemporary analytical disciplines and offer an alternative perspective from the current orthodoxy.
|
14:00 | R in the community | Simon Jackson | P9 | R from academia to commercial business | applications, community/education, big data, industry, skill development | Rhydwn McGuire | NA |
A 2017 report by StackOverflow showed that the use of R is greatest and growing fastest in academia. Commercial industries like tech, media, and finance, however, show the smallest usage and lowest adoption rates of the language. Yet learnings regarding the use of R and data science in academia and commercial settings complement each other. This presentation will share my experience as an R user moving from academia into commercial business; the transition moving from cognitive scientist at an Australian University to being a data scientist at one of the world’s largest travel e-commerce sites, Booking.com. I’ll discuss how the cutting-edge R skills used in academia can improve commercial product development. I will also identify the knowledge gaps I had moving into commercial business. This will be relevant to academics looking to move into industry, and business employers looking to hire data scientists from academia. |
14:20 | R in the community | Joseph Rickert | P9 | Connecting R to the "Good Stuff" | algorithms, models, applications, big data, interfaces | Rhydwn McGuire | NA |
In his book, Extending R, John Chambers writes: One of the attractions of R has always been the ability to compute an interesting result quickly. A key motivation for the original S remains as important now: to give easy access to the best computations for understanding data.R developers have taken the challenge implied in John’s statement to heart, and have integrated R with some really “good stuff’ while providing easy access that conforms to natural R workflows. Rcpp and Shiny, for example, are both spectacularly successful projects in which R developers expanded the reach of R by connecting to external resources.In this talk, I will survey the ongoing work to connect R to “good stuff” such as the CVX optimization software, the Stan Bayesian engine, Spark, Keras and TensorFlow; and provide some code examples including using the sparlkyr package to run machine learning models on Spark and the keras package to run deep learning and other models on TensorFlow. |
14:40 | R in the community | Lisa Chen | P9 | Using R to help industry clients – The benefits and Opportunities | visualisation, algorithms, models, data mining, applications, web app, reproducibility, multivariate, networks, performance, text analysis/NLP, big data | Rhydwn McGuire | NA |
Dr Lisa Chen is Chief Analytics Officer for Harmonic Analytics. She is a highly qualified and experienced data scientist, with a PhD in Statistics and a Bachelor of Science in Computer Science and Statistics. Lisa has extensive experience using R including designing solution-based models for complex optimisation problems, and analysing large-scale datasets in R. Harmonic has helped customers globally to address business challenges, across sectors including; agriculture, aviation, banking, energy, government, health, telecommunications and utilities. We use R in our daily project work and also help clients with data science team development and R Training. We will outline how we have used R, and RShiny and the benefits realised. We will discuss our journey, data-driven approach, workflow and industry observations. We will discuss our learning with R, e.g. observations regarding Big Data with R, version control and some of the pain points & work-arounds. We will share our observations on how clients are starting to adapt open source and R for their analytical work, plus the trends and opportunities. Lisa will demonstrate examples of our interactive client dashboards. |
14:00 | Community and education | Jonathan Carroll | P6 | Volunteer Vignettes; A Case-Study in Enhancing Documentation | applications, reproducibility, community/education, documentation | Kim Fitter | click here |
Vignettes; long-form documentation for a package. Often a use-case, discussion, or scientific article. These are incredibly useful to both users and developers. In 2017, Julia Silge scraped CRAN and found most packages don't have one [1].At the start of 2018, I decided to give back to the community by 'being the change I wanted to see in the world' and writing a Volunteer Vignette a month, for the entire year. Yet all the new and interesting packages I could think to write something for already had vignettes.The solution came to me in February; have the community nominate packages. I made the call via Twitter [2] and received an encouraging response. I set about writing the first Volunteer Vignette and immediately discovered bugs and other issues, all of which have lead to positive discussions with the author and updates to the package.In this talk I will present my first six months of the Volunteer Vignettes Project. I will demonstrate why vignettes are an invaluable step in making a robust R package.[1] https://juliasilge.com/blog/mining-cran-description/[2] https://twitter.com/carroll_jono/status/961139524901527552 |
14:20 | Community and education | Robin Hankin | P6 | Special and general relativity in R | visualisation, community/education, space/time | Kim Fitter | NA |
Although mostly used for statistics, R is a general purpose tool andhere I discuss how the R programming language can be used in thecontext of physics education. Here I introduce two R packages thathave been used in the teaching of Einstein's theories of special andgeneral relativity.The 'gyrogroup' package implements the Lorentz boosts for relativisticvelocity addition. It provides dramatic visualization of thelittle-known fact that relativisitic Lorentzian velocity addition isneither commutative nor associative. The 'schwarzschild' packagepresents visualization of black hole physics, and gravitational waves.In this presentation I discuss these two packages and also the moregeneral issue of R used as a teaching tool in the context of physicsmore generally. |
14:40 | Community and education | Sam Clifford | P6 | Classes without dependencies | community/education | Kim Fitter | click here |
Although important, learning statistics isn’t generally why students choose to study science. To engage a cohort of first year Bachelor of Science students with diverse backgrounds and interests, we decided to design their core first year quantitative methods unit (with no math or programming prerequisites) around R.The course is designed to be practical; using RStudio and tidyverse packages rather than statistical tables, students can quickly engage in visualisation, data wrangling, writing functions, and modelling as part of a coherent workflow for scientific inquiry.In this talk, we discuss the learning and teaching principles and activities, outlining the use of blended and problem based learning to teach both the quantitative topic and the use of R, developing students' data analysis skills and confidence.We discuss how workshop activities , quizzes, problem solving tasks, and the final project (a collaborative scientific article) not only assess students' skills but prepare them for work as a professional scientist. We will discuss students’ feedback on their experience in their journey from novice student to young scientist. |
14:00 | Scalable R | Le Zhang | P7 | Build scalable Shiny applications for employee attrition prediction on Azure cloud | visualisation, models, data mining, applications, web app, reproducibility, performance | Michael Lawrence | NA |
Voluntary employee attrition may negatively affect a company in various aspects. Identifying employees with inclination of leaving is therefore pivotal to save potential loss. Data-driven techniques, assisted by a machine learning model, exhibit high accuracy in prediction for employee attrition and offer company executives insightful information for decision making.The talk will cover a step-by-step tutorial about how to build a model for employee attrition prediction and deploy such analytical solution as Shiny-based web service on Azure cloud. R is used as the primary programming language and method for the development. Novel R packages such as AzureSMR and AzureDSVM that allow data scientists and developers to programmatically operate cloud resources and seamlessly operationalize the analytics within an R session, will also be introduced in the talk. Shiny application of the analytics including interactive data visualization and model creation is designed and deployed on Docker containers orchestrated by Kubernetes. Parameters of the deployment environment are carefully tuned to favor scalability of the application. |
14:20 | Scalable R | Bryan Galvin | P7 | Moving from Prototype to Production in R: A Look Inside the Machine Learning Infrastructure at Netflix | data mining, reproducibility, performance, big data, interfaces | Michael Lawrence | NA |
Machine learning helps inform decision making on just about every aspect of the business at Netflix, so it is important to empower our data scientists with tooling that makes them more effective. To accomplish this, we developed Metaflow-a platform written in Python for data scientists to develop, run, and deploy projects without getting in their way. Some key design features include: * Ability to work with the R packages we all know and love with no restrictions* Scale up seamlessly from local development to the almost infinite resources in the cloud * Automatic checkpointing of data and code with immutable snapshots created at each step of the modeling pipeline * Deployment made easy with built-in hosting service and schedulingIn this talk, I will present an overview of some of the best practices that are baked into Metaflow, focusing especially on those that can be applied effectively at organizations that are not at Netflix scale. Additionally, I will cover some of the lessons learned from using reticulate to interface R with a large Python project. |
14:40 | Scalable R | Jason Gasper | P7 | Integrating R into a production data environment: A case example of using Oracle database services and R for fisheries management in Alaska. | applications, databases, reproducibility | Michael Lawrence | NA |
Catch and economic information from fisheries off Alaska are critical for the management and conservation of marine resources. The National Marine Fisheries Service, Alaska Regional Office, uses an Oracle database to monitor and store federal fishery catch data off Alaska. Annually, the system processes over 2 million+ fishery catch transactions, and it currently houses over 25 years of historical fishery data. Information in the database includes details on harvested fish, estimates of bycatch, at-sea observations of discards, electronic monitoring of catch (video-derived estimates), geospatial information, and complex business rules to monitor catch allocations to ensure overfishing does not occur. Our paper provides an high-level overview of the system architecture, with a focus on our use of R-Cran for both development (e.g., simulation and testing) and production (e.g., statistical features) within our Oracle database. |
14:00 | Visualisation | Paul Murrell | P10 | The Minard Paradox | visualisation | Carson Sievert | click here |
Charles Joseph Minard's depiction of Napoleon's 1812 Russian campaignmight be described as the best statistical graphic ever drawn ... byhand. Minard did not have the benefit of modern computer technologyto help with his drawing; he did not have the option of importing aGoogle map tile; and he probably did not even consider the possibilityof interactive tooltips. However, there are aspects of what Minardproduced by hand that are very challenging for modern graphicalsoftware, particularly the thick bands that represent the size ofNapolean's army over time. This talk will describe the 'vwline'package for R and explore some of the interesting challenges thatarise when attempting to render variable-width lines with software. |
14:20 | Visualisation | Natalia da Silva | P10 | Interactive Graphics for Visually Diagnosing Forest Classifiers in R | visualisation, data mining, web app | Carson Sievert | NA |
This paper describes structuring data and constructing plots to explore forest classification models interactively. A forest classifier is an example of an ensemble since it is produced by bagging multiple trees. The process of bagging and combining results from multiple trees produces numerous diagnostics which, with interactive graphics, can provide a lot of insight into class structure in high dimensions. Various aspects are explored in this paper, to assess model complexity, individual model contributions, variable importance and dimension reduction, and uncertainty in prediction associated with individual observations. The ideas are applied to the random forest algorithm (Breiman, 2001) and projection pursuit forest (da Silva et al., 2017), but could be more broadly applied to other bagged ensembles. Interactive graphics are built in R (R Core Team, 2016) using the ggplot2( Wickham, 2016), plotly (Sievert et al., 2017), and shiny (Chang et al., 2015) packages. |
14:40 | Visualisation | Chun Fung (Jackson) Kwok | P10 | Rjs: Going hand in hand with Javascript | visualisation, interfaces, JavaScript | Carson Sievert | NA |
Many of the popular data visualisation packages in R, e.g. Plotly, Leaflet and DiagrammeR, are powered by JavaScript. I will demonstrate how far a little JavaScript can go towards creating animated and interactive visualisations from within R. This is done with the package, Rjs, which provides a simple interface between R and JavaScript. It allows you to seamlessly combine R modelling packages with JavaScript interactive visualisation libraries. This talk is for researchers, data analysts, and intermediate R users looking to extend their skills in interactive data visualisation. |
14:00 | Complex models and performance | Hong Ooi | P8 | SAR: a practical, rating-free hybrid recommender for large data | algorithms, models, applications, big data | Kelly O'Briant | NA |
SAR (Smart Adaptive Recommendations) is a fast, scalable, adaptive algorithm for personalised recommendations, based on user transaction history and item descriptions. From an end-user's point of view, SAR has the following benefits. First, it is relatively easy to explain to a nontechnical audience, compared to algorithms that rely on matrix factorisation. Second, it doesn't use subjective ratings, which can be unreliable given the pervasive influence of social media: a product that gets review-bombed after going viral will have meaningless ratings. Third, it takes event times into account, thus allowing recommendations to evolve with changing trends. Finally, it does well in recommending cold items, by building a regression model on item data. In this talk I'll discuss two separate implementations of SAR: a standalone one in base R, and an interface to an Azure web service. The former allows easy experimentation and evaluation, while the latter provides more options and is scalable to production-scale datasets. |
14:20 | Complex models and performance | Fang Zhou | P8 | Jumpstart Machine Learning with Pre-Trained Models | algorithms, models, reproducibility, interfaces | Kelly O'Briant | NA |
As a community many of us are building models (statistical and machine learning) that address various scenarios. At conferences, like UseR!, but also across many academic conferences, researchers publish papers that introduce new algorithms with implementations available on GitHub, implemented in R and Python and other frameworks. The community also makes available pre-trained models, especially deep learning models, to demonstrate or highlight the capabilities of the algorithm. To foster a healthy collaboration and for the reproducibility of key results, it is important that fellow data scientists can read about a new algorithm or approach and to be able to try it out very quickly to see whether it meets their needs. While pre-trained machine learning models are available, they are often difficult to set up and evaluate. We are exploring a framework to make this process simpler by making it easy for any data scientist to investigate and evaluate pre-trained models. We will share our learnings and our proposal to enable data scientists to quickly discover pre-trained models that will support them to be able to get from zero to hero in short order. |
14:40 | Complex models and performance | Stepan Sindelar | P8 | FastR: an alternative R language implementation | applications, performance, R implementations | Kelly O'Briant | NA |
R is a highly dynamic language that employs a unique combination of data type immutability, lazy evaluation, argument matching, large amount of built-in functionality, and interaction with C and Fortran code. It is therefore a challenging task to develop an alternative R runtime that is both compatible with GNU R and can provide performance of R code comparable to static programming languages like C.FastR is an open source alternative R implementation that is trying to achieve this. The talk will introduce FastR and demonstrate the performance improvements it can offer, compatibility with GNU R by being able to run unmodified popular complex CRAN packages like ggplot2 or Shiny, and FastR unique features, for example in-process multi-threaded execution, and tools like CPU sampler or viewing R memory dumps with VisualVM. |
15:30 | Genomics, signatures to single cells | Momeneh (Sepideh) Foroutan | P10 | Singscore: a single-sample gene-set scoring method for analysing molecular signatures | visualisation, applications, bioinformatics | Peter Hickey | NA |
Several single sample gene-set enrichment analysis methods have been introduced to score samples against gene expression signatures, such as ssGSEA, GSVA, PLAGE and combining z-scores. Although these methods have been proposed to generate single-sample scores, they use information from all samples in a dataset to calculate scores for individual samples. This leads to unstable scores which are influenced by sample size and composition in datasets. We have proposed singscore, a ranked-based and truly single-sample scoring method implemented as an R/Bioconductor package singscore. We compare singscore to other methods and show that our approach performs as well as other methods for large datasets in terms of stability, while outperforming them in small datasets. Singscore is fast and generates easily-interpretable scores. We show the application of this method in cancer biology, where the dependence between distinct molecular signatures can be investigated across samples. Singscore has potential applications in personalised medicine, as it calculates replicable scores for individual samples regardless of the sample size or composition in the data. |
15:50 | Genomics, signatures to single cells | Liam Crowhurst | P10 | scIVA: Single Cell Interactive Visualisation and Analysis | visualisation, data mining, web app, reproducibility, bioinformatics, big data | Peter Hickey | NA |
Technological advances enable measurements of gene expression at single cell resolution, creating datasets for investigating biological processes in life science research. Gene Expression data is commonly represented as a matrix of tens of thousands of genes and up to millions of cell, which has created a demand amongst biologists for quick visualisation and analysis. We developed scIVA, a Shiny web app that is designed to be used as an interactive visualisation tool of gene expression datasets, intended for those with little R experience and for users to gain preliminary insights into datasets for further exploration and analysis. The web app will also be available for download as a standalone R package. The web app performs various visualisations, all of which are interactive and downloadable through use of Plotly, integrated with d3 Javascript, as graphing tools. Moreover, scIVA allows for users to search for specific genes, subset by clusters and subpopulations, generate heatmaps and perform statistical analyses. The presentation will include a demonstration of the web app’s key features. |
16:10 | Genomics, signatures to single cells | Sarah Williams | P10 | Celaref: Annotating single-cell RNAseq clusters by similarity to reference datasets | applications, bioinformatics | Peter Hickey | click here |
Single-cell RNA sequencing (scRNAseq) is a way of measuring gene expression of many individual cells simultaneously, and is often used on samples which contain a mix of different cell types. In an scRNAseq analysis individual cells are typically clustered to group them by cell type. After clustering, identifying what type of cell is in each cluster (e.g. neurons) usually needs domain-specific knowledge of marker genes and function. The celaref package accepts pre-computed cell-clusters and aims to suggest cell-types for each cluster via similarity to reference datasets (scRNAseq experiments or microarrays) from similar samples. Briefly, within-dataset differential expression is calculated to identify the most enriched genes for each cluster, then their rankings are examined in reference datasets. Kolmogorov–Smirnov tests are used to decide if multiple matches should be reported. Initial experiments on brain, lacrimal gland and blood PBMC samples show sensible matching between similar cell types without overreaching on dissimilar cells. Celaref will be submitted to Bioconductor and is available at https://github.com/MonashBioinformaticsPlatform/celaref |
16:30 | Genomics, signatures to single cells | Luke Zappia | P10 | clustree: a package for producing clustering trees using ggraph | visualisation, algorithms, data mining, bioinformatics | Peter Hickey | click here |
Clustering analysis is commonly used in many fields to group together similar samples. Many clustering algorithms exist, but all of them require some sort of user input to set parameters that affect the number of clusters produced. Deciding on the correct number of clusters for a given dataset is a difficult problem that can be tackled by looking at the relationships between samples at different resolutions. Here I will present clustree, an R package for producing clustering tree visualisations. These visualisations combine information from multiple clusterings with different resolutions, showing where new clusters come from and how samples change clusters as the number of clusters increases. Summarised information describing the samples in each cluster can be overlaid on the tree to give additional insight. I will also describe my experience developing clustree, particularly how I have made use of the ggraph package. The clustree package is available at https://github.com/lazappi/clustree and a preprint describing clustering trees can be read at https://www.biorxiv.org/content/early/2018/03/02/274035. |
15:30 | Data mining | Ilia Karmanov | P9 | Teach yourself deep-learning with R | visualisation, algorithms, models, Deep Neural Nets, CNNs, MLPs, Machine Learning | Kevin Kuo | NA |
R's concise matrix algebra and calculus functionality makes it easy to create machine learning-models from scratch. Creating models from scratch is a great way to learn how they actually work. We show how R can be used to create a linear regression, MLP and CNN from scratch (see blog: http://blog.revolutionanalytics.com/2017/07/nnets-from-scratch.html) and thus how one may go about teaching themselves about DNNs. We believe this "hands-on" approach to learning is more effective because it exposes the user to all the "leaky abstractions" that modern frameworks hide and helps them understand what makes the models fragile. R's simple interface lets us easily "play" with the created models to understand further (potentially abstract) topics, e.g: (i) visualise the classification boundary and thus investigate what effect number of neurons (and layers) have. (ii) Visualise different CNN filter-maps. (iii) Solve a neural-net deterministically through linear-programming (without SGD) by working through "Proof of Theorem 1" in "Understanding deep learning requires re-thinking generalization" by Zhang 2017 (as a mirror to solving linear regression with SGD). |
15:50 | Data mining | Angus Taylor | P9 | Deep learning at scale with Azure Batch AI | algorithms, models, Deep learning | Kevin Kuo | NA |
In recent years, R users have been increasingly exploring the use of deep learning methods to solve difficult problems from computer vision to natural language processing. However, developing deep learning models is a time-consuming and compute-intensive task. To obtain good performance on many datasets, it is necessary to test many combinations of network structures and hyperparameters. In this talk, we will discuss how Microsoft Azure Batch AI can be used to perform this tuning task at scale on clusters of GPU-enabled virtual machines in the cloud. Developers create a single R script to define tests of multiple different network configurations, using the popular deep learning frameworks mxnet or Keras. We explain how to build a simple Docker image that can be deployed across multiple machines and defines the necessary installation dependencies. Batch AI will scale VM clusters as necessary to parallelize the tasks and obtain the optimal network configuration efficiently, saving hours or even days of the developer’s time. We will demonstrate the value of Batch AI with a live demo of training a deep learning model, implemented in R, on the classic MNIST computer vision dataset. |
16:10 | Data mining | Timothy Wong | P9 | Modelling Field Operation Capacity using Generalised Additive Model and Random Forest | algorithms, models, multivariate, big data | Kevin Kuo | click here |
In any customer-facing business, accurately predicting demand ahead of time is of paramount importance*. Workforce capacity can be flexibly scheduled at local area accordingly. In this way, we can ensure having sufficient workforce to meet volatile demand.In this case study, we focus on the gas boiler repairing field operation in the UK. We have developed a prototype capacity forecasting procedure which uses a mixture of machine learning techniques to achieve its goal. Firstly, it uses Generalised Additive Model approach to estimate the number of incoming work requests. It takes into account the non-linear effects of multiple predictor variables. The next stage uses a large random forest to estimate the expected number of appointments for each work request by feeding in various ordinal and categorical inputs. At this stage, the size of the training set is considerable large and does not fully-fit in memory. In light of this, the random forest model was trained in chunks / parallel to enhance computational performance. Once all previous steps have been completed, probabilistic input such as the ECMWF Ensemble weather forecast to give a view of all predicted scenarios. |
16:30 | Data mining | Bernd Bischl | P9 | iml: A new Package for Model-Agnostic Interpretable Machine Learning | algorithms, models, machine learning | Kevin Kuo | NA |
iml implements model-agnostic interpretability methods to explain the functional behavior and individual predictions of machine learning models. A large advantage of model-agnostic interpretability methods over model-specific ones is their flexibility, as often not one but many types of machine learning models are evaluated for solving a task. Anything that is build on top of an interpretation such an interpretation, e.g., a visualization or graphical user interface, now also becomes independent of the underlying model.Currently implemented are:Feature importance, Partial dependence plots, Individual conditional expectation plots (ICE), Tree surrogate, LocalModel: Local Interpretable Model-agnostic Explanations, Shapley value for explaining single predictions.The talk will cover the basic concepts behind model-agnostic interpretations, and demonstrate the functionality of the package through applied examples in R.Link to CRAN release: https://cran.r-project.org/web/packages/iml/index.htmlLink to Github page: https://github.com/christophM/iml |
15:30 | Simulation and modeling focus on surv anal | Bachmann Patrick | P6 | Estimating individual Customer Lifetime Values with R: The CLVTools Package | models | Marie Trussart | NA |
Valuing customers is key to any firm. Customer lifetime value (CLV) is the central metric for valuing customers. It describes the long-term economic value of customers and gives managers an idea of how customers will evolve over time. To model CLVs in continuous non-contractual business settings such as retailers, probabilistic customer attrition models are the preferred choice in literature and practice. Our R package CLVTools provides an efficient and easy to use implementation frameworks for probabilistic customer attrition models. Building up on the learnings of other implementations, we adopt S4 classes to allow constructing rich and rather complex models that nevertheless still are easy to apply for the end user. In addition, the package includes recent model extensions, such as the option to consider contextual factors, that are not available in other packages.This article will focus on both, the theory of the underlying statistical framework as well as about the practical application using real world data. |
15:50 | Simulation and modeling focus on surv anal | Sam Brilleman | P6 | simsurv: A Package for Simulating Simple or Complex Survival Data | models, simulation; survival analysis | Marie Trussart | click here |
The simsurv package allows users to simulate simple or complex survival data. Survival data refers to a variable corresponding to the time from a defined baseline until occurrence of an event of interest. Depending on the field, the analysis of survival data can be known as survival, duration, reliability, or event history analysis. It has been common to make simplifying parametric assumptions when simulating survival data, e.g. assuming survival times follow an exponential or Weibull distribution. However, such assumptions are unrealistic in many settings. The simsurv package provides additional flexibility by allowing users to simulate survival times from 2-component mixture distributions or a user-defined hazard function. The mixture distributions allow for a variety of flexible baseline hazard functions. Moreover, a user-defined hazard function can provide even greater flexibility since the cumulative hazard does not require a closed-form solution. This means it is possible to simulate survival times under complex statistical models such as those for joint longitudinal-survival data. The package is modelled on the survsim package in Stata (Crowther and Lambert, 2012, Stata J). |
16:10 | Simulation and modeling focus on surv anal | Raju Rimal | P6 | R-package for simulating linear model data (simrel) | models, applications, web app, multivariate, interfaces, Simulation | Marie Trussart | click here |
Data science is generating enormous amounts of data, and new and advanced analytical methods are constantly being developed to cope with the challenge of extracting information from such “big-data”. Researchers often use simulated data to assess and document the properties of these new methods. Here we present an R-package `simrel`, which is a versatile and transparent tool for simulating linear model data with an extensive range of adjustable properties. The method is based on the concept of relevant components and a reduction of the regression model. The concept was first implemented in an earlier version of `simrel` but only for single response case. In this version we introduce random rotations of latent components spanning a response space in order to obtain a multivariate response matrix Y. The properties of the linear relation between predictors and responses are defined by a small set of input parameters which allow versatile and adjustable simulations. In addition to the R-package, user-friendly shiny application with elaborate documentation and an RStudio gadget provide an easy interface for the package. |
16:30 | Simulation and modeling focus on surv anal | Andrés Villegas | P6 | StMoMo: An R Package for Stochastic Mortality Modelling | models, applications | Marie Trussart | NA |
In this talk we use the framework of generalised (non-)linear models to define the family of generalised Age-Period-Cohort stochastic mortality models which encompasses the vast majority of stochastic mortality projection models proposed to date, including the well-known Lee-Carter and Cairns-Blake-Dowd models. We also introduce the R package StMoMo which exploits the unifying framework of the generalised Age-Period-Cohort family to provide tools for fitting stochastic mortality models, assessing their goodness of fit and performing mortality projections. We illustrate some of the capabilities of the package by performing a comparison of several stochastic mortality models applied to the Australian mortality experience. The R package StMoMo is available at http://CRAN.R-project.org/package=StMoMo. |
15:30 | Improving performance | Helena Kotthaus | AUD | Optimizing Parallel R Programs via Dynamic Scheduling Strategies | models, performance | Earo Wang | NA |
We present scheduling strategies for optimizing the overall runtime of parallel R programs. Our proposal improves upon the existing mclapply function of the parallel package, which already offers a load balancing option that dynamically allocates tasks to worker processes. However, this mechanism has shortcomings when used on heterogeneous hardware architectures, where different CPU cores might have vastly different performance characteristics. We thus propose to enhance mclapply with a new parameter that allows mapping tasks to specific CPUs. The new affinity.list parameter, already available on the R-devel branch, allows setting a so-called CPU affinity mask that specifies on which CPU a given task is allowed to run. We demonstrate the benefits of the new mclapply version by showing how it can speed up parallel applications like parameter tuning. In this case study, we develop a regression model that guides the scheduling by estimating the runtime of a task for each processor type based on previous executions. In a series of code examples, we explain how this approach can be generalized to develop efficient scheduling strategies for parallel R programs. |
15:50 | Improving performance | Stepan Sindelar | AUD | Combining R and Python with GraalVM | applications, performance, programming languages interoperability, debugging | Earo Wang | NA |
GraalVM is a multi-language runtime that allows to run and combine multiple programming languages in one process and operating on the same data without the need to copy the data when crossing language boundaries. Moreover, the dynamic just-in-time compiler included in GraalVM is capable of applying optimizations across the languages boundaries. The languages implemented on top of GraalVM include FastR, an alternative R implementation, C, Ruby, JavaScript, and recently added GraalPython.The talk will present interesting ways how R and Python can be combined into a polyglot application running on GraalVM, for example using R package from Python or vice versa, and briefly explain how this interoperability works on the technical level. One of the most important parts of a language ecosystem is tooling and especially interactive debugger. The talk will also present how one can debug multiple GraalVM languages at the same time in the Google Chrome Dev Tools, for instance stepping from R into C code. |
16:10 | Improving performance | David Smith | AUD | Speeding up computations in R with parallel programming in the cloud | models, performance, parallel programming | Earo Wang | click here |
There are many common workloads in R that are "embarrassingly parallel": group-by analyses, simulations, and grid-based computations are just a few examples. In this talk, I'll provide a review of tools for implementing embarrassingly parallel computations in R, including the built-in "parallel" package and extensions such as the "foreach" package. I'll also demonstrate how you can dramatically reduce the time for a complex computation -- optimizing hyperparameters for a predictive model with the "caret" package -- by using a cluster of parallel R session in the cloud. With the "doAzureParallel" package, I'll show how you can create a cluster of virtual machines running R in Azure, parallelize the problem by registering backend to "foreach", and shut down the cluster when the computation is complete, all with just a few lines of R code. |
16:30 | Improving performance | Romain François | AUD | rrrow: an R front end to Apache Arrow | algorithms, performance, big data, streaming data, interfaces | Earo Wang | NA |
Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication. Languages currently supported include C, C++, Java, JavaScript, Python, and Ruby.R support is currently being implemented and in this talk we will discuss the various challenges, and our short, medium and long term vision for the connection between R and Apache Arrow. |
15:30 | Sports analytics | Robert Nguyen | P7 | Using Australian Rules Football to Broaden the Appeal of R and Statistics Among Youth and Public Without a STEM Background | visualisation, models, applications, reproducibility, community/education, interfaces | Alex Whan | NA |
Our talk explores how sports analytics can be used to encourage those without a STEM background into the application of statistics and programming to a real world environment. Through the use of a R package (fitzRoy) related to AFL we aim to both lower the barrier to entry for data access while also increasing analytical fan engagement in AFL. We will also talk about common issues that arise for the first time R package builder.A key barrier to entry for the growth of the AFL community is data access which prevents not only people having a go at writing, but it also prevents current media having reproducible work. By having an R package with online lessons on creating common fan rating systems like ELO, Pythagorean and Massey this will engage people who otherwise might have put learning statistical modelling and R into their personal *this is too hard* bucket. Commonly users are taught from a cleaned dataset and jump straight into modelling. This misses a key part which is cleaning. Our package, we aim to use tangible examples of scraped, raw AFL data from afltables and footywire to teach users how to clean scraped data themselves to get it into a tidy format for modelling. |
15:50 | Sports analytics | Alex Fun | P7 | Using TMB (Template Model Builder) to predict the winner of a ping pong match | algorithms, models, applications | Alex Whan | NA |
In a recent and popular stats.stackexchange post, the following question was asked:“I bet with my colleague that I will beat him in fifty consecutive ping pong games. So far I have won 15, what are my chances of winning the next 35 games?”-- from https://stats.stackexchange.com/questions/329521/To answer this question, I propose the following data generation process for the score-line in each game: The OP (original poster) is a far superior player who still wishes to make the game fun for their opponent (they are colleagues after all). This leads to a regression problem for the OP’s probability of winning a point, that cannot be fit using standard regression packages. This introductory talk will demonstrate how to use the TMB (Template Model Builder) package with an optimisation algorithm to find maximum likelihood estimates for the regression coefficients. This will show that TMB is a very useful and efficient tool, that allows the practictioner a lot of flexibility in exploring novel data generation processes and objective functions. I will also briefly touch upon using C++ from R, and automatic differentiation, which is great for those that dislike multivariate calculus. |
16:10 | Sports analytics | Andrew Simpkin | P7 | A Shiny app used to predict training load in professional sports | visualisation, algorithms, models, applications, databases, web app, multivariate, performance, streaming data | Alex Whan | NA |
We have developed a Shiny dashboard web application used in professional sports to predict player load while planning a training session. This app allows coaches to better plan, prescribe and tailor training drills in advance. The Shiny dashboard app is deployed on Shiny Server Pro and connects to an SQL database of GPS data across multiple teams and sports. Teams can plan, save, edit and delete planned sessions to and from the GPS database. Based on retrospectively collected GPS and accelerometer data, we have developed a statistical learning algorithm to cluster similar drills and predict training load. The model achieves correlations over 0.95 in out-of-sample testing, with median differences of below 1% of GPS outcomes. |
16:30 | Sports analytics | Sayani Gupta | P7 | CricketData: An R package for international cricket data | visualisation, data mining, applications, web app, reproducibility | Alex Whan | NA |
The CricketData package provides convenient scraper functions for downloading data from ESPNCricinfo into tibbles. Functions are provided for obtaining data on the performance of male and female players across Test, One Day International and Twenty20 formats, and for batting, bowling and fielding. Tidyverse packages can then be used to explore, visualise and analyse the data. The package enables a user to answer simple questions such as -What is the highest number of catches taken by a wicket keeper?-What is the maximum number of catches taken by a fielder in a particular innings?-How many batsmen have scored consecutive 100’s in two matches or more?-What is the maximum number of maiden overs by a bowler in a specific innings?It will also allow for deeper questions to be addressed such as-Do batsman tend to get run out more frequently when they are about to score a century?-How does the performance of cricketers change in the 12 months before he/she retires?-When is the period of peak performance during a cricketer’s career?Finally, it makes it easy to produce visual comparisons of player performance across different statistics. |
15:30 | Leveraging web apps | Katie Sasso | P8 | Shiny meets Electron: Turn your Shiny app into a standalone desktop app in no time | applications, databases, web app, reproducibility, interfaces, Automation | Johnathan Carroll | NA |
Using Shiny in consulting can be challenging, as all deployment options involve either sending intellectual property and data to the cloud or IT involvement. When providing consultative like services to extremely large, risk-averse, enterprises this can greatly restrict one’s ability to quickly get Shiny apps into users’ hands, as engagement of IT can take months if approved at all. We’ll share how the Columbus Collaboratory team overcame these barriers to rapid deployment by coupling R Portable and Electron, a framework for creating native applications with a variety of web technologies. All the tools needed to use Electron for desktop deployment of Shiny apps will be reviewed. We’ll highlight a specific example in which these technologies were used within a large enterprise to completely automate a weekly report. We’ll also share how the app used R Packages such as openxlsx, shinydashboard, RODBC, and Zoo to query an internal database, cleanse data, calculate key metrics, and create a downloadable excel file for dissemination. The best part? This Shiny app was delivered to the end business user as a stand-alone executable. https://github.com/ksasso/Electron_ShinyApp_Deployment |
15:50 | Leveraging web apps | Adrian Barnett | P8 | Saving time for researchers by creating publication lists using shiny | applications, databases, web app, open access | Johnathan Carroll | click here |
Researchers are often asked by funders or employers to list their publications, but funders often have different requirements (e.g., all papers versus only those in the last five years) and researchers waste a lot of time formatting papers. To save time for researchers I made a shiny application (https://aushsi.shinyapps.io/orcid/) that takes a researcher’s ORCID ID and outputs their papers in alternative formats. It uses crossref and pubmed (rentrez) to supplement the ORCID data. The app was included in the Australian Research Council’s instructions to applicants and has been well used with many good suggestions for improvements. However, the ORCID data is relatively messy and papers can be in multiple formats, making it difficult to create a standardised paper that can be flexibly manipulated. For example, the publication’s author data are in different fields and formats. Google Scholar publications are nicely standardised, but there are authentication issues when using shiny.I will describe how the app has developed and canvass how it could be improved, including adding the percent of publications that are open access or other alternative research metrics. |
16:10 | Leveraging web apps | Gergely Daroczi | P8 | Managing database credentials and connections: an easy and secure approach | applications, databases, web app, interfaces, business | Johnathan Carroll | click here |
Although the `DBI` R package family already provides a standardized way of opening connections to various databases and querying data, and eg the `config` package allows to store the database connection default parameters in a central file, maybe some of the sensitive fields encrypted via `keyring` or the `secret` packages -- but there is no convenient and secure wrapper around these for the actual R end-users. This talk introduces a new package taking care of opening connections in the background to the databases specified in a secured and encrypted YAML file, so that the R user can simply specify the SQL command without the need to think about what DB backend and credentials are used. |
16:30 | Leveraging web apps | Ian Hansel | P8 | Large Scale Data Visualisation with Deck.gl and Shiny | visualisation, web app, space/time | Johnathan Carroll | click here |
deck.gl is a WebGL-powered framework for visual exploratory data analysis of large datasets' - https://uber.github.io/deck.gl/#/Combining deck.gl and shiny allows for rich interactive graphics of large datasets, in particular visualising GeoSpatial data. We will review how to integrate deck.gl with shiny using the upcoming R package 'deck.gl'. The talk will:- Review the underlying technologies; WebGL, Mapbox and React.js- Dive into an example exploring the latest Census from the Australian Beureau of Statistics- Compare to existing visualisation capabilities in the 'rthreejs' and 'leaflet' packages- Discuss how further integrations with React.js can enable more browser based interfaces to data and analyticsAfter the talk the attendees should:- Know how Deck.gl works- Understand how to visualise data in deck.gl from R using the 'deck.gl' package- Want to use deck.gl in their own work :)The talk is aimed at those with some experience (or interest) in GeoSpatial analysis. |