By Earo Wang | May 30, 2018
This is the transcript of useR! 2018 keynote speaker Steph de Silva, self-described data hazmat and grief counsellor.
Earo: Can you please introduce yourself? What’s your name, what do you do?
Steph: My name is Steph de Silva, and I’m a data science consultant. Recently, I moved down to Sydney, but in the last few years I’ve been working from rural New South Wales. I’m more of a generalist consultant rather than a specialist in any one field. My background was in econometrics, and I did my Ph.D. at the University of Sydney in time series/panel data and asymptotics. After that, I ended up at the World Bank. That was a very good learning experience for me because I basically had never worked with applied data before. It was always something that I wasn’t particularly interested in up until that point. I thought I was going to be a theoretician. Life had other plans. It turns out that I didn’t know a lot of real applied work. I got thrown in the deep end and it was great.
Earo: How did you first hear about R and when did you start using it?
Steph: Actually, I was sitting in a meeting with the World Bank. We were talking about Tonga, and there were a number of people in Tonga who wanted to be able to use their own data we were collecting. And I had a colleague turn around to me and said, “Have you heard of this thing called R?”
I thought, “This R business. Actually, I have a look into this.” So I did. It turns out R was written by people like me for people like me. It just makes sense in a very natural way. I understand that’s not the case for everybody. I think that’s one of the most exciting things about the tidyverse, actually. It’s not just a collection of packages, it’s also a philosophy about access.
Earo: you initiated R Ladies Remote? What motivates you to initiate this remote branch?
Steph: There’s lots of different ways a person can be remote. They can be remotely geographically like I was sitting on a farm. A person can also be isolated in lots of different ways culturally, particularly gender minorities. I find that sometimes people have that first baby or the second baby and it becomes so much harder to go to a meet up. They may not feel connected with their communities. Professionally that can be a very difficult and very lonely place. Other people might be isolated due to the fact that they may not be neurotypical. Going to a meetup group with 50 or 100 people is a very different prospect for somebody who’s very introverted, compared to just jumping on a Slack channel.
A remote branch offers the opportunity to people who are isolated in lots of different ways to not be isolated anymore. My cofounders feel the same way. Abigail lives in the UK. She has three kids like me. She felt that R-Ladies Remote would be useful for gender minorities who can’t attend evening meetups because they’ve got caring responsibilities. Auriel lives in a small town in the USA. There is no R-Ladies local chapter or R user group there because there just aren’t enough people. R-Ladies Remote is one way of connecting people who otherwise wouldn’t have access to this kind of education and professional development. It’s support, it’s friendship, it’s not being the only one you know who does this weird thing called R. So, I think that’s very valuable across that domain.
Earo: What do you recommend most about Australia, like, place or food to the guests when they are attending useR?
Steph: I think I might have scared a few people with my tales of farm life, about losing snakes in my house. I feel it necessary to reassure everybody that there will be no snakes at useR! And if there are, I promise I will catch them. (Though snakes are much less dangerous than bats in my view. I once got bitten by a bat. It’s a long story.)
Having said that, Australian wildlife is amazing. It’s definitely worth coming out to have a look. Brisbane’s a wonderful city. The weather was always great. It’s got great public transport and there’s so many things to do. If you’ve got kids, bring them. It will be a great, maybe even tax-deductible, holiday. Probably. I don’t think we can make any clear assertions around international tax deductibility, to be honest.
People are very friendly down here. We all love visitors, so if an Australian invites you to dinner at their home, or to give you a tour - they really mean it, they’re not just being polite!
Earo: why did you choose the World Bank to start your career after Ph.D.?
Steph: That was a confluence of circumstances. Like a lot of young women my age, I ended up having a couple of kids during my Ph.D.. Let’s just say that I probably give different advice these days. But that happened alot at the time because there were no other options for maternity leave. One of my babies was rather sicker than we planned on, so I had taken a couple of years off to get that in order. She’s a very healthy nine-year-old who’s learning to code now, so that worked out pretty well. When I was ready to come back to work, The World Bank had an opening. I learned all about the vagaries of applied work in the field very quickly, and that was a great learning experience.
Earo: In your bio on the useR! website you say you had a crash course in real data. What do you mean?
Steph: Well, young me was not very bright. I was very focused on theoretical work where I had intended to stay. I didn’t do much traveling through the World Bank. Mostly they kept me behind a desk generating models and bar charts. But working with data coming from these situations is very, very chaotic at times. Projects never go to plan. Unlike in theoretical work where we have a certain amount of control. One time I checked in with a team in Laos about the sampling scheme I’d put together. They didn’t follow it.
I was horrified. But they explained to me that the river had flooded and they didn’t want anybody to die. They’d chosen to go to a different school than the one I selected for them. I decided that was just a different kind of random and we had to roll with that. Things don’t always go to plan when working in field situations. I had to adjust a lot of my expectations. I had to learn to speak with many different people. They were often coming to English as a second language or had not had the benefit of tertiary degrees. Yet they own that data and I had to make sure I was communicating with them in a way that they understood and that gave them ownership of their own data rather than pushing them out of it. So, a huge crash course on many fronts.
Earo: That experience wasn’t that smooth in the beginning?
Steph: It was a very good learning experience. I had to radically change my mindset about what was important. When you’re working at theoretical work, what is important is the work that you’re doing. The results that you’re creating are important. When you’re working in field work, the results may not actually be the purpose. Our purpose was to improve early grade reading education in these countries that have chosen to participate. It was about making the statistics or the data science work for the project, not the project work for the statistics. I got there eventually, but it was a big one.
Earo: Have you been to any of those countries yourself?
Steph: I wasn’t doing data collection but I did do a workshop. I was working mostly on analysis, modeling and writing up reports. In Laos, I went over there to teach a course in statistics because the in-country team was keen to up their statistics game. There I was teaching a two-day course on statistics to people who only understand Lao, through a translator who was Thai (and a genius). I had people in that class who were saying to me, “Oh, can we look at principal components? That would be great.” I’m like, “Whoa.” We also had some people in that class who’d never opened Excel before. This is an extraordinary array of people and they all felt really invested in owning their own data.
They had gone out into the field to collect this data under incredible circumstances, literally on foot in some cases. And they wanted to own it and use it. To this day, it was the best course I ever taught. One of these guys, I found out later, he took his excel chart that he made with me, and he had to get one of the other guys to print it out in the office because he didn’t know how to use a computer that well. They printed it out for him and apparently it’s framed in his living room. And his response was, “I never thought I would be able to do a chart in excel.” That was really something special.
Earo: You did your Ph.D. in theoretical econometrics, but now you are a data scientist. How do you see the difference between them?
Steph: Chaos. Theoretical econometrics was all about randomness, but it was a very controlled type of randomness. We define that randomness very, very well. Working out in the field, whether that’s corporate or nonprofits, it’s ill-defined chaos. There’s plenty that we cannot control. I think Jenny Bryan might have mentioned it in her interview once, being very comfortable with it ambiguity. I thought that was a very profound statement. That is sometimes the value that we bring as a practical or applied data scientist out to the field. It can be about managing that ambiguity and working with people under those circumstances. I think chaos is the big difference. I never know when a project will actually start. I never know when it will actually finish. I never know who’s actually going to be working on it because those things are always changing. We have multidimensional uncertainty and just roll with it.
Earo: what’s your workflow when you’re doing a data project? Any advice you’d like to share about the workflow?
Steph: I’ve actually thought that Jenny Bryan’s work around workflow has possibly been the most influential thing to happen in data science in the last few years. I think that’s the case because it’s such a gain in terms of productivity. It’s strategic and well thought out. I’ve been using that a lot because she defines very clearly a lot of things that I’ve been thinking about in isolation and she does it better. Her work isn’t just about a single algorithm or method: it’s applicable to all of us working in this field - research, business or otherwise. That’s incredible!
In a consulting situation though, there’s a lot more to the workflow than just in the data science realm. For example, in data science, you’ll get the data. First thing I’d be wanting to do is validate that data because quite often the data they send to me is not the data they thought sent me. And then you have exploratory analysis, visualization and so on.
But consulting goes further on either side. There’s project initiation. Often when the client comes to you, they’ve got an idea of what it is they want, but they don’t know the details yet because they’re not data scientists. That’s why they came to you. Working out what is the client needs here? What is it that they want? What’s possible within their time and budget? Where are we going to cut things because we could keep going forever? Managing all of that at the beginning is a part of that consulting workflow.
Documenting those discussions so that the client has something to fall back to and knows what to expect is critical. Documenting the analytics, the outputs, and the objects at the end is also important. That’s one of the reasons why I like R. I think it’s a consulting powerhouse. You can do pretty much all of that in one place. Beautifully code-driven and version-controlled. As opposed to this hodgepodge of tools - each time you switch to a different tool, you’re opening yourself up to versioning and transfer risk.
There’s a communication workflow there as well, and that’s often a constant process back and forth with the client. The most important thing in consulting is making sure the data science is working for the business, not the business working for the data science. It’s happened to me on more than one occasion when I’ve gone to the client or gone to the team, saying, “Here’s a great model!” And they’ve said to me, “No. That is not how it works in our reality.” Not that there was anything necessarily wrong with the statistics, or the econometrics, or machine learning. It’s that it doesn’t represent their reality in the appropriate way. That’s really critical- having that domain knowledge and these constant inputs with the data owners and the project owners as well.
Earo: Every time you start a project, do you need to learn lots of new things and then get yourself to make good decisions about that project?
Steph: Absolutely. It’s about exploring that domain and often decisions about the analytics come much later. First of all, we have to make decisions about what the problems are that we’re trying to solve. What are the outputs that are going to be useful for the business or the NGO. Being very clear about that upfront and negotiating with the different stakeholders matters because often my idea of what’s critical is different to theirs. Sometimes they’ll come to me and they’ll say, “We really the model everybody’s talking about’. That’s what we’re coming to you for.”
But then you start unpacking it. I may find out that there’s a huge value in simple techniques that are not fancy. They’re not flavor of the month, but it gets them to where they need to go at one third the price. They’re really happy with that. The analysis is robust and their own analysts can be out there doing that as well. I think as a consultant, that’s an important part of my job: to be working with the analysts in-house, helping them so that they understand what’s going on, and empowering them to work with their own data. Data literacy is going to be so important going forward. People are so capable if you support them.
Earo: How do you define data literacy?
Steph: I think of data literacy as one of the parts of data science. I feel data science is made up of data agency, the ability to manipulate data at will. And data-literacy, understanding what it’s telling you, and understanding implications of how you’ve manipulated that data and what that says to you. It’s that sweet spot between letting the data speak and understanding that data is a human artifact.
If you want to create some kind of value, data literacy matters alot. You might be looking to reduce costs, or increase revenue, or get kids in Tonga to be reading better. You need to find an insight in the data to do that. But when you have the insight then you need to be able to explain it to someone who can use it. If it’s not explained, it can’t be used. Data literacy plays into both of those things.
Earo: How would you communicate your findings to your clients who do not have a statistical background? Do you find it difficult?
Steph: I find it’s learned art. I think that we could do a lot more to support our upcoming data scientists to learn that art. We think about statistics as a science, but applied statistics or data science is also an art form. It’s about making decisions constantly that don’t necessarily have hard and fast rules at times.
I have a few golden rules for communication that I follow. I’m very aggressive about not using technical language. Many people that I work with understand technical language, they will approach me and want to speak in technical terms. It’s fantastic. But if I start with technical language, I’m’ creating an environment that cuts a certain group of people out of the discussion. I’m letting them know this is not for you. This is for the grown-up table and you’re not at the grownup table. That’s not fair. It also loses insights that would otherwise be really useful. So, I’m aggressive about not using technical language.
I think data visualization is key to communication and it’s been one of the critical parts of data science in the last ten years and will continue to be going forward. It’s back to the concept of ‘find and explain’. Viz will find the insights you can’t get any other way, and it’ll explain them too. Every time you create a chart or a picture, you’re abstracting away from things that are potentially keeping some of your stakeholders from understanding what you’re doing.
Earo: What do you think is a good data analysis?
Steph: I think a good data analysis is one that’s fundamentally interrogatory. It’s all about questions. What questions are we going to ask at this data set? I think you always have to start where did this data come from? What made it? Why is it like it is? What is it that it can actually tell us? I think those are good first steps. Beyond that, it becomes very project specific and very domain specific as well. I think a very good data analysis is one that is intimately connected with the domain and the audience it’s for.
Often the data analysis that I do initially never sees the light of day because it’s it’s full of dead ends. Those first attempts are about unpacking it all, and it’s absolutely incomprehensible in its initial format. But that’s completely okay.
Earo: How do you think the open data issue in private and public sectors?
Steph: Many corporations are just understanding the fact that open source is amazing. Open data is a huge next step. Open data is a public good, but there can also be a public cost there as well. I think Chris Culnane at the University of Melbourne does some great work around privacy. He made the excellent point that open data has value, but the cost of that can be exposure. We have to find that a good place between maximum value and minimal exposure. Unfortunately, the people who are more likely to be exposed are often those that at least can bear the cost of it.
If we think about the Medicare data release—that was supposed to be de-identified. But Chris Culnane’s team was able to re-identify a number of different people in that dataset. Chris advocates for adversarial approaches to privacy. We need to be very clear about how we’re going to protect people.
Open data like citizen science and other scientific data sets may be quite safe to release without these complex issues attached. We have the potential there to create a lot of real value from that.
We have to decide as a society where that line is drawn - where the value outweighs the risk to these individuals. That’s part of the discussion. A lot of the companies that I’m working with are just not at that point to be having that discussion just yet. But that doesn’t mean that we throw our hands up and say, “Well, this open data business, it’s just too hard.” It’s like algorithms, and algorithmic decision making. “Too hard. I can’t do it.“ That’s a very luddite approach. We need to have eyes wide open. We have to be aware of the risks that we’re taking. We have to make those decisions knowing there are tradeoffs. No free lunches.