By Earo Wang | January 24, 2018
Roger Peng is Professor of Biostatistics at the Johns Hopkins Bloomberg School of Public Health. His research focuses on statistical methods for environmental health problems. He is also a co-founder of the Johns Hopkins Data Science Specialization, the Simply Statistics blog, the Not So Standard Deviations podcast with Hilary Parker, and The Effort Report podcast with Elizabeth Matsui, and recipient of the 2016 Mortimer Spiegelman Award from the American Public Health Association.
Earo: You worked as a software engineer for a couple of companies in summers while you were doing your undergraduate at Yale. Tell us a bit more about those work experience.
Roger: That’s right. Yeah, a long time ago. So I worked at two companies for summer internships just as a software engineer in 1997 and 1998. It was a very different world back then, and it was before the internet boom. The first job was developing satellite communications software, and it was mostly C++ programming and building like graphical user interfaces. I knew how to program in C, but C++ was kind of new to me. I had to learn how to use like Microsoft Visual Studio, and C++. It was a lot of fun actually. And then the second job I had was at a defense military contractor and they were building software for submarine navigation. That was just C programming in a Unix environment, which I was more familiar with. But the work that I did there was mathematical. It was more like implementing mathematical algorithms. They had a huge software engineering group there, and the people there were great. I learned a lot of C programming and Unix stuff there. One of the things I learned though, from both of those experiences, is that I didn’t really wanna be a software engineer. I had great opportunities, but it didn’t feel like enough. So those two experiences made me decide that I didn’t wanna do that in the future.
Earo: So why did you decide to become an academic?
Roger: I was doing the summer internships, and I was trying to decide whether to go to graduate school or to work at a company. At that time, everyone was going to work for software companies like Microsoft. There was no Google at the time but everyone wanted to work for Microsoft. That was the best place to work and still is a good place I think. So everyone was doing software engineering, but I decided I didn’t want to do that because one thing I really miss was teaching. You really don’t have many opportunities to do teaching at a company. That’s why I decided to go to graduate school. The other thing that influenced me a lot is that my older brother is an academic too. What he was doing had a lot of influence over me. I like doing research, I like teaching, and so that’s already most of the job. It never occurred to me, at the time, to do anything else.
It’s worth saying that in statistics at that time, it’s not like now where there were so many opportunities. There were other opportunities outside academia. If you’re getting a PhD in statistics, there were a number of opportunities outside mostly in pharmaceutical companies. You could work for SAS or a couple other things. But it wasn’t like now where there’s just like every company needs a statistician basically. It’s an explosion of data science but it wasn’t like that. So those opportunities didn’t really appeal to me at that time. And academia, I think, it did and it still does appeal to me.
Earo: You have some papers addressing reproducible research, and when did you start concerning the issues of reproducibility, and how do you think the role of R in the reproducible research?
Roger: I started thinking about reproducibility when I started at Johns Hopkins back to 2005. Because a lot of the work that I did there was environmental health research, air pollution research. A lot of that research at the time, and even now, is very controversial because environmental regulation, at least in the United States. So the research is constantly being challenged, which is good. It’s not like a research shouldn’t be challenged. But there is an additional need for transparency when you have very high stakes research that could have implications for national level regulations. So I think that’s kind of the original context. We felt like if people are gonna be working in this area, then we should try to open up as much as possible. So that there’s transparency in what we’re doing, and people can have access to our software, and our data, to the extent possible, and so as a way to allow more people to understand what we’re doing. Because this is a public policy issue. It’s not some laboratory science issue where it only affects a few people. This research could affect many millions of people. So that’s like the basis that we came to it and we wrote a number of papers saying how this is important especially for areas of research that have a large impact on policy, and which air pollution research does.
I think that was where we started and R at the time was the best choice. Because first, R itself is open source. You have the ability to examine the code and what it’s doing. And second of all, there were already some tools for writing reports at the time. It was to allow for more reproducibility and transparency. Obviously that has dramatically been improved over the last 10 years with knitr and related tools. The tooling for writing reproducible documents in R is very sophisticated and I think that problem has largely been solved. In the past, the tooling was very ad hoc. You had to put different things together, for example LaTeX and all these other stuff. But now it’s much simpler. You can use markdown in RStudio. That problem is 90% solved. So the tooling is not the issue, I think. It’s more about getting people to get into the workflow, and the habit of making things reproducible.
Earo: What’s your workflow for the reproducible research?
Roger: When I get involved in a project I tend to start with some R scripts. Depending on the nature of the project, one thing I’ll often do is I’ll use an R Markdown document for data that needs to be preprocessed. If there’s very messy data, I need to do a lot of transformation and preprocessing. I’ll often do that in R Markdown document, whereas in the past, I would have just used a script. Because in the R Markdown document, I can write little notes and make little tables and graphs to check. None of these graphs or tables are important, and they are not gonna be published or anything, but they are useful to check to make sure the data is okay.
Earo: You’re not only an academic doing research in university, but also writing blog posts for Simply Statistics, teaching online data science course through Coursera and hosting two podcasts—Not So Standard Deviations and The Effort Report. It seems you enjoy a lot with digital medias. So, any stories you’d like to share with us?
Roger: That’s a very open-ended question. (haha) All of these started with doing these online courses. My colleagues and I at Johns Hopkins had talked about putting material online. But we never quite understood what was the best way to do that. In 2013, a lot of these platforms started coming like Coursera, or other platforms too. They started being developed and we thought, “Okay, this is a good opportunity.” But we didn’t have much experience doing that before. We never created videos. So I had to learn a lot about how to edit videos, how to record things, how to use cameras. And so once I started doing that I found that, first of all, it’s always fun to learn something new. I think I’ve enjoyed doing that. My colleagues didn’t enjoy it as much. They prefer to maybe have someone else do it. But I like editing the videos and doing the recordings. I found it fun. Once we had gotten good at doing that, I felt like it would be interesting to start a podcast, and I’d like to learn about audio editing and the microphones and all these equipments we have here. First, there weren’t a lot of podcasts about data science or academia in the style that I like. That’s why I decided to start this podcast (NSSD). I find it fun to work with new technologies and to learn new tools, even though these new tools are not like data science tools.
Earo: But you produce data science material using those tools.
Roger: Yeah, that’s my justification I guess. I do wanna contribute to the data science community in some way, even if I’m not building R packages necessarily. But I hope that I can contribute to the community a little bit. And mostly, it’s just fun for me to pursue these other projects, because they allow me to learn new things. But eventually, maybe not right away, a lot of that learning can feed back into my academic work too in teaching. Unfortunately, I don’t have any crazy stories.
Earo: Have you found the best camera angle for yourself when you’re doing videotaping?
Roger: Actually, one funny story is that, when I was recording the lectures for my online courses, I always point the camera in the same direction in my office. Because in my office there’s not enough room to put the camera, I always put the camera in the same place. In the background is always my bookshelf. And so every lecture video I have is my bookshelf in the background. You’d be surprised at how many comments I get about my bookshelf. And people are making comments about different books that I have there. They are like, “Oh, this changed.” Or, “You took this off the bookshelf.” That never occurred to me that people would notice it.
The other thing that we did find with the course is that the quality of the audio is the number one most important thing. The video itself doesn’t matter. But if the audio is not good, everyone complains about it. I had to better understand how to record good audio, because that’s way more important than getting a good video.
Earo: Could you talk more about the differences among these medias, for example different ways to approach your target audience?
Roger: One of the things that the whole experience has allowed me to do is to experiment with different types of media. So we have these online courses using those videos, and we put those videos on YouTube. For each of the courses that we built, we have a corresponding textbook that we published through an online publisher. We self-publish it. I got some experience doing self-publishing and also the podcast. The audiences are all very different for that. What I’ve learned is that when you create something, that’s only part one of what you’re trying to do. The second part is how are you going to get that thing to the people who want it, which is extremely hard.
One of the advantages that we have with our books is that we have a built-in mechanism for doing that is the course. In the course we say, “Here’s the book.” And that’s a perfect alignment because the book has all the material for the course and the people who are taking the course are obviously going to be interested in it. But if you build a podcast, how are people gonna find out about your podcast? It’s not always obvious. But luckily, by the time I had built the podcast, we had the blog and these online courses, also Twitter and social media. So we had built up these channels for distribution basically. But it’s very hard to reach an audience without having the appropriate channels to do the distribution. That’s something that I had to learn from scratch. I would think that I’ll just build something and then people will just find out about it, right? But it’s not true.
For the kinds of things that we do, it’s not like we’re going to be advertising on TV. The group of people that we’re trying to look for is very small. So we have to think a little bit about how are we going to reach out to them directly. That’s the thing that I often talk to people who are thinking of writing a book or whatever. Usually the first thing I say to them is, “How are you gonna get that book in the hands of people who want it, or who need it?”
Earo: Not So Standard Deviations is a fortnight podcast since September 2015, and each episode is about one hour long. How do Hilary and you find a topic to talk about?
Roger: One of my concerns has always been, for both podcasts, that we’re just gonna run out of topics. But it’s been over two years and we haven’t really run out. But every two weeks, we always seem to have something to talk about. Usually, there’s something in the news or there’s something that’s just come up that we thought about. One of the nice things that Hilary and I have done about Not So Standard Deviations is to talk through a lot of fundamental questions that we have about data analysis. What makes for good data analysis and how do you teach it? I think these are recurring topics that come up every once in a while. I’ll go to a conference or she’ll talk to somebody and she’ll come up with some new idea, and then we’ll discuss it on the podcast. I think it will probably not end for a long time unless we somehow solve this problem. But it’s been fun to develop that conversation over time. Then there’s always some fun things that happened along the way. You’d be surprised at how much it can happen in two weeks.
Earo: Do you feel proud when people say like, “Roger, you’re my R teacher”, or “I’ve been reading your blog for many years.”, or “I enjoy a lot with your podcasts.”?
Roger: I don’t know if “proud” is the right word. It’s very satisfying to have people say that and to see that it had some positive impact on them. That’s why we do it. We teach this every day at our institution, but we only reach a limited number of people there, people who can afford to go there and can travel to Baltimore or wherever to go to university. But now, with the Coursera and all these online courses, we can reach so many more people. It’s always very gratifying to meet these people and hear their stories and how they learned all over. It makes me very happy to hear that. It’s always funny sometimes people know my voice before they recognize my face, because they don’t watch the video but they listen to it. So a couple of times I’ve been in conference, and people would say, “Oh, I heard your voice from somewhere.”. Because they’ve listened to it for so many hours, so that’s always a little bit interesting.
Earo: You’ve been living in the Australia for about half a year. How do you find the life in Australia and what’s your favorite part?
Roger: I’d say Australia is amazing. It’s been an amazing experience for me and my family. We have loved every minute of it. It’s been really incredible. It was very easy to adapt to coming from America. The people here seem much nicer though I would say. Everything is so friendly. The attitude’s a little different here, and people seem more relaxed. Obviously, it’s a much smaller place in terms of the population, and so I think that has a big impact on it. Melbourne is amazing and the food here is amazing.
Earo: But unfortunately you’re still drinking long black.
Roger: Yeah, I can’t give up my coffee. I had to have it American style. I’ve learned a lot just from the people around Melbourne. It’s been really great for me here, and obviously I’ll definitely be sad to leave.
p.s. Roger is playing table tennis at Monash creative ping pong room.