June 27 - June 30 2016
Stanford University, Stanford, California
Forty years of SRick Becker, AT&T Research Labs, NJ |
Bell Labs in the 1970s was a hotbed of research in computing, statistics and many other fields. The conditions there encouraged the growth of the S language and influenced its content. The 40th anniversary of S is an appropriate time to relate a personal view of that scene and reflect on why S (and R) turned out as it did.
Rick Becker is lead intentive scientist at AT&T Labs-Research in Bedminster, NJ, specializing in statistical computing. He and colleagues created the S Language, Trellis graphics, and the AT&T Global Fraud Management System. He has used mobile call information to better understand cities, and performed vote counting and analysis for American Idol. Rick is an AT&T Fellow and a Fellow of the American Statistical Association.
Literate ProgrammingDonald Knuth, Stanford University |
The speaker will discuss what he considers to be the most important outcome of his work developing TeX in the 1980s, namely the accidental discovery of a new approach to programming --- which caused a radical change in his own coding style. Ever since then, he has aimed to write programs for human beings (not computers) to read. The result is that the programs have fewer mistakes, they are easier to modify and maintain, and they can indeed be understood by human beings. This facilitates reproducible research, among other things.
Don Knuth is the Emeritus Professor of the Art of Computer Programming at Stanford University. His influence as an educator, scientist and programmer is legendary. He has written over two dozen books including the multi-volume 'The Art of Computer Programming', considered by American Scientist to be among the 100 or so books that shaped a century of science. Don is the winner of numerous awards including the ACM Turing Award, the Kyoto Prize and the National Medal of Science. He created the TeX system that revolutionized the field of technical typesetting. He invented Literate Programming, a method that aims to write programs for human beings rather than computers.
Statistical Thinking in a Data Science CourseDeborah Nolan, University of California, Berkeley |
The intuition and experience needed for sound statistics practice can be hard to learn, and a course that combines computing, statistics, and working with data offers an excellent learning environment in this regard. Moreover, an integrated approach to data science creates opportunities to reinforce statistical thinking skills throughout the full data analysis cycle, from data acquisition and cleaning to data organization and analysis to communicating results. As a result, students gain the ability to reason computationally, actively engage in statistical problem solving, and learn how to keep abreast of new technologies as they evolve. This talk describes approaches and provides examples for teaching data science in this integrated fashion.
Deborah Nolan is Professor of Statistics and holds the Zaffaroni Family Chair in Undergraduate Education at the University of California, Berkeley. Her work in statistics education focuses on teaching statistical thinking in real world contexts and includes the books Stat Labs: Mathematical theory through application (with T. Speed), Teaching Statistics: A bag of tricks (with A. Gelman), and Data Science in R: A case studies approach to computational reasoning and problem solving (with D. Temple Lang).
RCloud - Collaborative Environment for
|
Analyzing Big Data in real life poses challenges with respect to performance, methodology and reusability. R is well known for its succinct syntax for analytic tasks as well as plethora of tools for data analysis and visualization, but it is not always associated with scalability. In this talk we will present a scalable environment that allows the use of R (and other languages) in a collaborative setting, enabling sharing, reusability and reproducibility. In addition, it opens new possibilities for visualization and interactive graphics by providing seamless integration of JavaScript and R. Finally, the distributed nature of the design allows us to provide R tools that allow out-of-core data processing interfacing different back-ends including Hadoop without sacrificing the ease of use of R. We will also show a flexible framework for developing distributed models in R while re-using as much of existing work as possible. As part of this talk we will illustrate the use of those tools on real data sets, including interactive visualization and distributed computing.
Simon Urbanek is a Lead Inventive Scientist at AT&T Labs in Big Data Research and a member of the R Core Development Team with interests in visualization, interactive graphics, distributed computing and Big Data analytics. He is the author of numerous R packages including Rserve, rJava, multicore, iotools, iPlots, RJDBC, Cairo and many others. Simon received his Ph.D. in Statistics at the University of Augsburg in 2004.
Towards a grammar of interactive graphicsHadley Wickham, Rstudio |
I announced ggvis in 2014, but there has been little progress on it since. In this talk, I'll tell you a little bit about what I've been working on instead (data ingest, purrr, multiple models, ...) and tell you my plans for the future of ggvis. The goal is for 2016 to be the year of ggvis, and I'm going to be putting a lot of time into ggvis until it's a clear replacement for ggplot2. I'll talk about some of the new packages that will make this possible (including ggstat, ggeom, and gglayout), and how this work is also going to improve ggplot2.
Hadley Wickham is Chief Scientist at RStudio and a member of the R Foundation. He builds tools (computational and cognitive) to support data science. His work includes R packages for data analysis (ggvis, dplyr, tidyr), data ingest (readr, readxl, haven), and principled software development (roxygen2, testthat, devtools). He is also a writer, educator, and frequent speaker promoting more accessible, more effective and more fun data analysis.
Flexible and Interpretable Regression Using Convex PenaltiesDaniela Witten, University of Washington |
We consider the problem of fitting a regression model that is both flexible and interpretable. We propose two procedures for this task: the Fused Lasso Additive Model (FLAM), which is an additive model of piecewise constant fits; and Convex Regression with Interpretable Sharp Partitions (CRISP), which extends FLAM to allow for non-additivity. Both FLAM and CRISP are the solutions to convex optimization problems that can be efficiently solved. We show that FLAM and CRISP outperform competitors, such as sparse additive models (Ravikumar et al, 2009), CART (Breiman et al, 1984), and thin plate splines (Duchon, 1977), in a range of settings. We propose unbiased estimators for the degrees of freedom of FLAM and CRISP, which allow us to characterize their complexity.
This is joint work with Ashley Petersen and Noah Simon at University of Washington.
Daniela Witten's research involves the development of statistical machine learning methods for high-dimensional data, with applications to genomics and other fields. Daniela is a co-author (with Gareth James, Trevor Hastie, and Rob Tibshirani) of the very popular textbook "Introduction to Statistical Learning". Daniela is the recipient of a number of honors, including a NDSEG Research Fellowship, an NIH Director's Early Independence Award, a Sloan Research Fellowship, and an NSF CAREER Award. Her work has been featured in the popular media: among other forums, in Forbes Magazine (three times), Elle Magazine, on KUOW radio, and as a PopTech Science Fellow. Daniela completed a BS in Math and Biology with Honors and Distinction at Stanford University in 2005, and a PhD in Statistics at Stanford University in 2010. Since 2014, Daniela is an associate professor in Statistics and Biostatistics at University of Washington.