GINA: Hello, everyone. Welcome, thank you for joining us tonight. My name is Gina Neff and I am the Executive Director of the Minderoo Centre for Technology and Democracy at the University of Cambridge. I am delighted that you're joining us tonight for the launch of the book Human-Centered Data Science: An introduction. We have an exciting evening planned for this event. I'd like to invite our panel, our collection of fabulous people who are here this evening, to join me on screen now. As we are doing that, thank you. Before I introduce them, I want to let you know that the event is being professionally live human captioned. If you would like to have captioning you can select it at the toolbar at the bottom of the screen. Also there is a Streamtext captioning for this event that is available and a fully adjustable stream of the event will be in your browser, if you want to open this the link is shared in the chat to this event. A transcript will also be made online available after the event. So before we begin, a few points of housekeeping. The event is recorded by Zoom and streamed to an audience on the platform. By attending the event you are giving your consent to this recording. The recording will be available on the Minderoo Centre for Technology and Democracy and the CRASSH websites shortly after the event. Our guests tonight - we will speak for about 30 minutes and then we'll open up the discussion and take questions from the audience and you can ask those questions through the Q&A function on Zoom and I'll share those questions during our discussion, so please do not place questions in the chat, we are not monitoring the chat, but there is a Q&A button at the bottom of your screen and you can use that for questions. We would very much appreciate if you could complete a short feedback questionnaire after this session and the link to that will be sent via Eventbrite. Please follow us on Twitter and other social media platforms. Our address is @MCTDCambridge. You can use that if you're live tweeting tonight's event. So we're here to talk about human-centered data science. It is a new interdisciplinary field that draws from human/computer interaction, social science, statistics and computational techniques and tonight, ahead of the release of our book at the beginning of March, we'll talk about recommended approaches to addressing the bias and inequality that results from automated collection analysis and distribution of very large datasets. So our book, Human-Centered Data Science: An introduction, offers a brief and hopefully accessible overview of many of the computational data science techniques while explaining this data approach to human-centered problems and presenting practical guides and real-world case studies to help readers apply these methods. So what does this mean? We draw on well-established traditions of human-centered design to inform better data science practice and we're trying to push computational process, human-centered data science pushes these computational approaches to large-scale data to include the kind of rich detail, contextual knowledge and deep understanding in that qualitative research and mixed methods bring to the understanding of data in society. Now, my co-authors and I have helped to expand this field by writing about these practices, hosting workshops and tutorials, to develop them with researchers - putting them into practice in our own projects and teaching them in our classrooms. We would argue that this deep understanding of human and social contexts that shape data helps data scientists respond to ethical thinking and the choices they face in their everyday practice. So in our approach, data science is much more than - data ethics, excuse me, is much more than simply a checklist. Human-centered approaches can help data scientists develop empathy for the subjects of their data, leading to greater acceptance and use of data science systems. This is important because the decisions that data scientists make and they make into algorithms may have unintended individual and social harms. That's in part because data science has developed a significant power over human lives and as data scientists we have to think carefully about that power, its potential consequences and our responsibilities to do something about it. That's what we're here to discuss tonight and that's what we're trying to do in the book. So with that, allow me to introduce my co-authors. We have Cecilia Aragon, Professor, Department of Human Centered Design and Engineering at the University of Washington. Shion Guha assistant professor, faculty of information at the University of Toronto. Marina Kogan, Assistant Professor, School of computing, University of Utah. And Michael Muller, research staff member at IBM Research. One of the things that was really important to all of us in this book was to make sure we bring in those real stories from the field and how people actually do data science, so we commissioned a series of case studies and tonight we have with us Casey Fiesler from the University of Colorado, whose case study in the book explored the ethics of scraping data. So with that, I would like to turn our attention over to Casey and I think she will be sharing some slides. Casey?
CASEY: Thank you very much. All right, just one moment. So I was asked to contribute a small part of this book really to the ethics of data scraping, which I actually think is a great topic for thinking about human-centered data science because by the time data looks like a spreadsheet, it's very easy to lose track of the people in that data, particularly if the data was scraped in an automatic way such that you weren't necessarily engaging with the people in it. So in thinking about some examples to talk through this as an issue, one that came to mind that I know a lot of people are very familiar with was the algorithm that could determine from a photograph whether someone was gay or straight - predict from a photograph, I should say. There was a lot of conversation about this particular thing that happened, mostly around whether this research was appropriate, what it was for, what the intention was, but something that I think got lost in it was how this research was done, because this required a massive amount of data and where do you get a dataset of people labelled with sexual orientation? From a dating site. I think this is a real example of there being real people in this data that are training algorithms, you know, doing data science, and we sometimes lose track of them even in the middle of a scandal or controversy about something else. We use data collected created by people that is public constantly. I mean the number of papers that use Twitter data, for example, in a data kind of way is just astronomical and I'm not saying at all that any of this is bad but that if you're going to make ethical decisions there are a lot of different things you should keep in mind. But what I tend to here is two things - whether it's a violation of terms of service and whether the data is public. So a few years ago my co-workers and I analysed a lot of data scraping provisions in terms of service and what we found was they don't give you a not, they are agnostic, the kinds of things you would think would matter for an ethical decision - like what is the data, what are you going to do with it, who are you, what are the expectations of the users, none of this makes its way into terms of service. Which then offers two issues with terms of service as an ethical thing. One is that it assumes that it's inherently unethical to violate terms of service which I think is not the case in part because it's quite easy to disallow data scraping and then no-one else can study it which one might argue is a problem. Also we've seen issues with, you know, doing algorithmic audits to build up cases of discrimination, et cetera, but also terms of services are not the only thing that can make data collection unethical which brings me to my second one, us which is, is the data public? The number of times I've heard this is even higher than the number of papers I've seen! A good example of this is someone who went through all of the data on OkCupid without any anonymisation at all and when asked about this, the researcher said, "The data is already public". This is such a common term that my collaborator Michael Zimmer has a cooling report called is data already public and this is for a research paper he wrote in 2010. I also did some research about how people feel about their data being used, and the TDLR here is people have no idea that researchers are reading their tweets and they think it has to be against the rules and I won't tell you anything about this chart except that it indicates a lot of other things that matter than the publicness of the data, lots of contextual things including, for example, what the tweet is about. The idea that a tweet about what someone had for breakfast an equivalent and should be treated the same as a tweet revealing a sensitive medical condition is completely absurd. So a couple of more examples are about this kind of thing. Another research project - YouTube videos of people going through gender transition were used to train facial recognition. In a story about this, people indicated their surprise and alarm that they as a human were in this dataset that might be used to create technology that could harm their community when they had put themselves in a vulnerable position to share not with researchers, but with other people in their community, that they could help. My PhD students have also studied how people in queer communities and black communities feel about researchers using their data and something that came up was people not understanding the community. We talk about positionality as researchers a lot in human subjects, you know, human subjects research, how are we going to interface with interview participants - we don't talk about it with data science, but I would argue that it's even more important because if you're studying data without understanding that community and you're not talking to people, then positionality becomes important there as well. So I will leave you with this really important point, it is inspired a little bit by the very first part of Reddit's community guidelines, which is, "Remember the human" and I think it's really important that even when our data just looks like numbers, we are remembering the humans and the data. Thank you.
GINA: What a great note to launch on, thank you, Casey. Next I would like to bring up Cecilia to talk a little bit about what human-centered data science is and a little context for the book.
CECILIA: Thank you, Gina, thank you, Casey. That was a wonderful case study that really illustrates what human-centered data science is - that we need to remember the human in the data. So thank you for making that so vivid, Casey. This is why, in our book, we wanted to focus on the real uses of how people are thinking about data and ethics and algorithms, because it's complicated. You can often make it very abstract and focus too much on, "Well, this is an algorithm, it has 91% effectiveness" and stop at that. What we argue in our book and what we feel is incredibly important is that you simply cannot do that. We cannot afford to build the technology first and then worry about the ethical and societal implications later. Instead, we need to integrate a human perspective throughout data science. So this is our goal to reach out to everybody who is working on algorithms and data science and artificial intelligence and say, we need to think about the consequences of what we build. –
I'm going to take a step back and talk about the genesis of this book. So all of us here on the screen have known each other for many years. We are all specialists in particular areas but we all have concerns about the intersection of algorithms and data and the human. Let's see, I guess it was about seven or eight years ago that - - I'm a professor at the University of Washington in the eScience Institute, Gina used to be there as well - here at the eScience Institute we got the idea that it was important to think about the societal impacts of what was then known as 'big data'. The issue that I brought up at the time was, let's not just think about the algorithms, but also think deeply about the societal context. As both Casey and Gina brought up, data is not separate, it's not clean, it is not unbiased. It lives in a world that is rich and complicated and we need to think about this not at the end of the building of the algorithm but from the very beginning of the design. So we wrote a proposal to the Moore and Sloan Foundations, and they funded the Data Science Environment that included people from NYU and Berkeley and across the world. And this book was born out of the work that we did in the eScience Institute and that everybody else did in their own environments. I think it's becoming more and more clear in the world today the concerning consequences of algorithms that reinforce social biases - threats to our privacy, the spread of misinformation and disinformation and even, as I think we are very aware today, of the weaponisation of disinformation for use in cyber-warfare that has tremendous impacts on human lives. So what we hope to leave you with this book is - you know, it's great to bring in ethicists and social scientists into your data science company, but more than that - you need to do that, but you need to not silo them. You need to listen to them. Plus, all data scientists need to have some training and understanding of what the societal consequences of the work may be. This is what we hope this book will do, is that it'll be a start to teaching data scientists how to build ethical algorithms that consider the human context. This is a complex topic, but we try to make it accessible and easy to understand so that you all can change and hopefully improve the world going forward. Thank you so much.
GINA: What a great note to transition on. Next we have Shion and he is in the lucky situation of being able to teach the book in some of his classes, so, Shion?
SHION: Thank you, Gina. I will preface this by saying that we had many objectives for writing this book but one of the main objectives for writing this book is to make sure that we are training the next generation of data scientists at both the undergraduate and graduate levels with these kinds of a human-centered lens, so a human-centered mindset. I used to be in a different university while we were writing this book and while this book was in the process to transition I was hired at the University of Toronto because, guess what, they've been one of the first universities around to develop a programme in human-centered data science - looking at some of the folks in the audience I can recognise some of my colleagues and students, so great to see you. But in the past years I've been at UFT I've tried to shape the curriculum and nudge the curriculum towards the human-centered lens that we really talk about in our book and I think that from the variety of feedback that I've received from students thus far, it has been wonderful, it has been very rewarding. I see a hunger amongst the next generation of, you know, data science students who want to go out there into industry and apply a human-centered lens to the kinds of work that they do, and perhaps it would be wonderful to get some questions from some of them who are currently in this programme. But I think that we've seen early reinforcement of the idea that we did need a textbook like this, we did need, you know, some codified - a codified book where we could start training this next generation and I think, you know, this book can provide a variety of resources to not just human-centered data science undergraduate and graduate programmes but also lots of adjacent programmes. There are a lot of institutions and universities all across the world that are opening up various kinds of professional programmes that would definitely benefit from having such a book as this. So I'll leave off by saying I've really enjoyed the last couple of terms at UFT where even before this book was published, I've been able to make my students guinea pigs in trying to inculcate them with the human-centered lens, thank you.
GINA: That is our plan - to take over the world and indoctrinate our students into better data science practices! On that note, Marina has a really interesting perspective on who this book is for and kind of the vision that we had behind it - Marina?
MARINA: Thank you, Gina. Shion just talked about how important and necessary this book is for training the next generation of data scientists and I completely agree, I see this in my class, I'm teaching a data wrangling class right now and whenever I'm introducing any aspects of human-centered perspective in this class, emphasising the humans behind the data, my students tune in so much more, they're so interested, there is such a demand for this. So this is certainly the case and this is a big goal of this book. But I feel like the book also is necessary from the other perspective, from the social science perspective in a sense. There're so many excellent books of critique out there about the potential harms and biases of data science and machine learning, of algorithms, the kinds of oppression they might impose on certain groups or the kind of biases that they might perpetrate in certain societies, especially when historic biases are already encoded in the data. So that work has been done to some degree and it is excellent work out there, however most of that work has always been aimed at other social scientists. These are social scientists doing the work, the work of critique, and speaking to their audiences of other social scientists while doing that, which of course is very important, but I think it sort of stops in its tracks in terms of reaching a wider audience. So part of the work that we do with this book is try to translate this wonderful work, translate the work of critique that has been done in accessible ways so it's actually reachable, it could reach people who are practising data science or who want to practice data science, data science learners, and it doesn't sound like a bunch of social scientist jargon that sort of, maybe feels to the practitioners oftentimes as calling them out or blaming. But we're actually trying to help them understand this critique but also we move beyond critique in this book, and this is what makes me most excited about our book every time I talk about: it is how we are, I think, successful of combining this aspect of critique and translating it, but also moving beyond it and essentially offering concrete solutions - offering concrete methods or heuristics or advice about how to counter some of these biases, how to safeguard against some of these problems, how to consciously, from the beginning of your process, incorporate this human-centered perspective in what you do as a data scientist. I think that's extremely necessary and that's why I'm so excited about this book.
GINA: Thank you. Michael, you've got an interesting perspective on rallying around parts of this book?
MICHAEL: I was muted, of course! Sorry. I'd like to talk about our own responsibilities in creating and perpetuating biases in data science. Nithya Sambasivan and colleagues recently wrote a paper called, "Everyone wants to do the model work, not the data work," so in our book we focus in part on the human activities that we all do in our data work which is a necessary part of the socio-technical work of making a dataset fit for purpose. In the book we describe multiple ways humans intervene between the data and the model. This awareness leads to our chapter called Interrogating Data Science, where we remind ourselves and others that humans change the data - we do this wrangling work under time pressure with the best of intentions and our unexamined social biases can easily enter into how we discover, capture, clean, curate and create the data, and of course we must create the data when we make ground truth. Our effort is to bring these human activities into focus so that we can examine our choices, your choices, everyone's choices and the biases that we inevitably object and I might even say infect into our data. We have to focus on these human/socio-technical work practices in order to be aware of bias, we can't remove bias but we can become aware of the biases of ourselves and others and adjust our practices accordingly. Thank you.
GINA: Thank you, Michael, and all of you. I will just add a couple of thoughts before I open up with some questions for all of us. I am deeply humbled and honoured that I got to be a part of this incredible collaboration, including collaboration with our case study authors who are really just a great list of extremely thoughtful people from different perspectives bringing to the book. I saw a couple of things here. I would be the one coming from what Marina said, the critic's side, and what was great was sitting in parallel - parallel writing, sitting next to one another, co-located in a lovely space in the University of Washington, where we locked ourselves in a room for an entire week to sit and write this book together, to get the first draft started. There were many things that surprised me and one of the things that surprised me first was a completely different approach to computational problems and social science problems. Scientists say, what question am I trying to ask and how do I design a project - and I see Marina laughing, because so many times in data science projects you start with, here's the data, now what am I going to do with it? One of the things we really tried in the book was not to throw that out, certainly let's not throw the baby out with the bathwater, but to say thoughtfully, "How are we asking questions of datasets?" And if we do that, we're getting closer to the core of what a human-centered design approach might be - what problem are we trying to solve in the first place? So to Casey's point earlier in the evening, you know, just because we can, just because the data are out there ready to be scraped, doesn't mean we should. Just because we might be able to ask a particular question of data it doesn't mean it's actually the right question or the ethical question or the reasonable question to be asking. Then Cecilia brought in this wonderful story of the streetlamp effect - the man loses his keys but he looks under the streetlamp for the keys, not across the street where he actually dropped his keys. He looks where the light is shining because the light is better. Data science can often be like that and we use that metaphor in the book to kind of frame an arc to say many of these projects are - can be, well, we have the data, it's convenient, and so that's where we should be looking. But again, without asking those questions, is that getting us to the kinds of questions we want answered? And for me, you know, I really care a lot about - I was trained as an organisational sociologist and I really care a lot about this idea of the relationship between individuals and organisations and that's something that I think has been missing around the conversation around data ethics and data practice. There has been a real - I don't want to say simplistic, but I'm going to say simplistic on a Zoom call with 100 people on it, but there has been a real notion that if we want to make data science ethical we just teach data scientists a three-week module on their undergraduate degree and that's it, data science is now ethical! But we don't take into account that data science happens in organisations, it happens under pressures, just as Michael said, right? There is the human work of making data. Once we open that up and again bring that sort of human-centered approach, that human-centered design approach to thinking about the problem of data science, like what would it mean to design data science from the ground up, then we start to see where power is, where people are, where different pressures are. I think to just echo Marina, it gets us to a very different place of putting into practice good and ethical practices. So we're no longer relying on an individual within a system - we understand that it's systems that we need to be thinking about, it's structures, it's power and we need to teach people to be attuned and aware to those. But we can't simply have those critiques. Those critiques are important, the wonderful work we cite in the work, the really hard lifting around the importance of addressing biases in gender, in race, in sexual identity, in addressing problems of colonial power, the legacies of colonial power, in addressing kind of different and multiple world views and how Indigenous people around the world think about sovereignty and rights and data - bringing those perspectives together in a way that a data scientist can take on board, that's a huge - it was a huge challenge and I think it's something that's really exciting about this book. The book will help us hopefully open new conversations with new people and conversations that have been going on in different places and in new ways and bring them to different venues, so that's what I'm really excited about. While I'm bringing the questions up, and I'm going to urge everybody to put your questions in the Q&A bar that's at the bottom of your screen, I'll start looking through those. But perhaps I can open it up to any of my co-panellists who want to pipe in, you know, what surprised you either about the process of trying to work on this collaboration or, in Casey's case, of working with us as part of the case studies, what did you find kind of useful about having the conversation about data science in this human-centered way? What have you learned and what has been surprising?
MARINA: While others are gathering their thoughts, I might jump in. So because I have a background in both computer science and sociology, some of these concepts - I'm lucky enough to travel between these worlds in some way, between this world of critique and this world of application or more practical tools, and in some way, I have been unaware of my own privilege with respect to this. Trying to write this book in terms of translating the work of critique for others, for other people who don't have the social science background - I realise how hard it is to do. Working together, that has been an incredible experience for me because while I understand these social science concepts it doesn't necessarily mean that I was prepared to write about them for a more general audience and specifically for a data science audience so I've learned so much from my co-authors in working on this book and finding the ways to make these concepts accessible, making these concepts maybe less cryptic or less jargon-filled and all of that, kind of getting at these really deep ideas that social science has worked on for hundreds of years and making them more available out there. So that was really hard but also really, really rewarding.
GINA: In addition to that someone from the audience has asked, have we seen any pushback, is there any resistance to this approach and can anybody address that?
MICHAEL: Yes, sure. I did a guest lecture at Boston University recently and someone said, "Well, the data are the data, aren't they, and so why are you arguing against the data"? As if the data had a separate existence, and I like to say as if we were walking down the street one day and a perfectly formed data frame fell out of a tree onto our heads, and I think it's either Gina or Cecilia's story that we make the data, that people who make the data don't understand how interpretation works or even the choice of what are the data - that's a human choice, that is a speech act in which we make certain things data and certain things not data and it's really hard to persuade people who think that data have a separate existence from us that data don't have a separate existence from us - we make the data.
GINA: Anybody else want to jump in on either of those surprising or pushback, where pushback came from?
CECILIA: I will jump in quickly and say I was trained as a computer scientist and data statistician and I came from a very classic computer science background. So much of my own training was, “What we do is the most important thing, and all that other stuff is soft, it's weak, it's not as important.” I will just admit that I started out believing that myself. So maybe the pushback came from within me and I had to realise the error of my ways and as soon as I did, here's what's truly amazing is rather than softening and weakening my work, it actually made it much stronger. My data science and machine learning and artificial intelligence work became so more powerful and more effective when I started paying attention to the human-centered piece. So it is not that humans and rigour are two separate and contradictory things - it's super important to realise that. Instead, data science and computation becomes better and stronger and more rigorous when it is human.
GINA: That is beautifully said and we thank Anna for posing that question on pushback. We have a question from Jude Clark who asks, given that a lot of everyday data collection and analysis is derived for informal, unmanaged situations like marketing, politics, business, how do you see your book impacting on better practice? So given that these are less regulated areas, how do we make better practice?
MARINA: Well I think, just to get us started, I think the point Cecilia was just making actually kind of answers that question to some degree because at some point even in these areas where maybe people only maybe care more about results, per se, than the methods or the ethical dimensions of the methods, at some point the efficacy of the methods will come through or the lack thereof: that if we're not accounting for social contexts and we're trying to collect this political data, we might be missing a lot of nuance. We've had this happen with presidential election polling and other aspects where, you know, we can't really trust the predictions anymore because we're missing some of the nuance, we're not really considering different populations that are being polled or how people respond to being polled in general - the effect of being interviewed and how people might respond to survey questions in a way that they expect the interviewer wants them to respond. All these nuances of social context play a role in how effective these methods are and so I think at some point sooner or later, people practising these methods in various kinds of industries will probably realise that by accounting for these things, they actually produce better products, produce better outcomes.
MICHAEL: I guess I will say to Jude's point, and it is a very good point, I am a big fan of regulation and regulated industries, but on the leading bleeding edge of things it's up to us to do the ethical things because inevitably society will lag behind what we are doing and it's up to us to hold up the problems and to say, "Look, look, this can go wrong, this can go bad, you could do things you never imagined you would do to someone unless you're careful". So it's on us first and then subsequently it's on us to lobby our legislators and regulators and help them to help all of us do better.
GINA: I guess I would just add, too, that one of the things that struck us - we didn't want an ethics chapter of this book, we felt very strongly on that and our reviewers had a very different opinion, they said, "Where is the ethics chapter, I don't see it in the table of contents". But we wanted to have - and again I think my co-authors who are centrally in the human-centered design, human-centered engineering field, that in part because you are so attuned to this idea that you solve problems based on a context with a set of tools rather than a checklist, "This is ethical, this is unethical," you have a framework for approaching the problem and so I would say to Jude's question, you know, one of the things we're trying to do is help people understand that there's this kind of way of approaching data science that can be better if we remember how data are always implicated in a set of human contexts and human relationships.
MICHAEL: I would like to add quickly - the ethics chapter is the book!
GINA: Yes, thank you, reviewers who pointed that out to us on our proposal! I'm going to jump a little bit to the last question on the Q&A and forgive me, I don't have my glasses with me, so I can't actually read your name, but the question thanks us for writing the book and says looking forward to reading it, and the question is, what are the perspectives on how viewing and handling the same data might be different in different parts of the world guided by local cultural practices - how do you address the nuances of cultural difference in the book?
SHION: I can take that. So I think that we have consistently pointed out in the book that human-centeredness is always local and very culturally contextual and throughout the book in the variety of chapters we've provided practical suggestions, guides, little nudges about what you as a practitioner could potentially do, keeping in mind your local cultural practices and tradition. What that means is very different things, right? It could mean diversity in the data science team that's handling the data, it could refer to a specific part of the world where you are at, where there are different norms, perceptions and values for how data are handled, but it could also mean the relationship between, you know, what senior leaders are telling you to do and what you think you should be doing. So we've consistently tried to point this out - that, you know, we're not - we shouldn't really be in the business of prescribing a one-size-fits-all model precisely for that reason. It is also impossible for us to enumerate all of the various approaches that different parts of the world or different kinds of organisations or different kinds of groups, communities with relationships and values that are divergent from one another could look at data, but we've pointed this out that, you know, this is not a - human-centeredness doesn't mean that there is a one-size-fits-all solution. A very good recent example of this that I love to point out and I think I've ranted about this to my students already is that - it is the COVID-19 pandemic. So when the COVID-19 pandemic hit everyone wanted to build global models of COVID case predictions and, guess what, all of those models are wrong. They were never useful, most of them did not predict COVID deaths or infections or cases or hospitalisations in the ways that they should be doing. It is impossible to build - and I will fight with everyone about this - it is impossible to build a global COVID-19 model for let's say the United States, or Canada, or India or China - you can't do it. But what can you do? Can you do it for a more locally relevant area? Could you do it for let's say what's happening in parts of Massachusetts? Do you do it for a healthcare system that is in rural Indiana. Could you do it among Indigenous populations in northern Canada? Yes, you can. You can build contextual models and we've pointed out throughout the book that this is important and I should point out that we've made the distinction that by definition human-centered data science is not model-centered data science, it's human-centered, so that points out that we should be taking into account all of these relevant issues. Anyway, I just wanted to mention that, I hope, Ramaravind, that that answered your question, thank you.
GINA: Michael, do you want to jump in on that?
MICHAEL: Sure, I can't resist, can I? I think we did list a few - a very few, I think in my part of America there's 567 groups or bands of native people not including Alaska and in Shion's area I think it's more than 600 and we can't list all of their distinctive and virtually ethical ways, and we did make a reference to an ethical set of principles from Nunavut and we quoted the fourth declaration of the Lacandon Jungle, "We want a world in which all worlds exist". But this doesn't mean that this book is an exhaustive cultural catalogue, we couldn't do that.
GINA: I think we were also aware that we wanted the book to translate across an international audience in ways, so we did try to pay attention to that, even though we overwhelmingly represent in some ways or other North America more than others. We have a question - and we are getting close to the end of our question time, so if you have a question, please pipe into the Q&A down blow. Tyler asks us to explain the reasoning behind choosing the term 'data science' rather than machine learning, it seems like we do a lot with machine learning in the book but why use the term 'data science' instead of 'human-centered machine learning' - does data science imply something broader?
CECILIA: I'll jump in for that as someone who has done a lot of machine learning and data science, yes, data science is broader. You can't do a lot of machine learning without data science but you can do data science without machine learning. But all of these terms essentially overlap, machine learning, data science - what they all have in common is they're all going to use very large datasets so we feel data science is the broadest term and the one that's most critical because, as Casey pointed out so clearly and well, that you know, data are people, right? All data is human and we wanted to focus on that and counter the common view that, "Oh, data is scrubbed of human influence, it is unbiased". So that's why we used that term.
MARINA: I want to piggy-back on that very shortly to say that so many human decisions go into organising the data, cleaning the data and all of the other messy data wrangling stuff you have to do before you can apply any machine learning model or any other kind of model to the data. I have been teaching this data wrangling class to the data science undergrads for the third time this semester and we're seeing that so clearly in every term of this class: how many human decisions go into, how are you going to deal with these missing values, right? Am I going to delete this entire column, am I going to impute the values? There are so many choices that an analyst can make and missing the fact that the data is representing the people so before we even get to the modelling part of machine learning there is a lot of work that's human-defined.
GINA: That's great. We have one more question, we have time for one more, Audrey Guinchard, if we were inspired by Helen Nissenbaum's contextual integrity framework, and she demonstrated that data being public does not mean data being up for grabs because of the inherent contextual nature of data, and if we've been influenced by this work could we explain, and I think all of us will have perspectives on that and the perspectives of the idea of data and context and I see Casey nodding, too - Casey, would you like to come in on that?
CASEY: Yes, I mean I think that is an astute observation! The importance of context and contextual integrity in particular as a framework makes a lot of sense here. In fact I would say so one of the small case studies that I brought up was the OkCupid dataset and Michael Zimmer who I also mentioned wrote a great paper, Data Plus Society, analysing that case, analysing how that would work for research ethics. I sort of - like the "remember the human in the data" thing is actually a lot about context and that is a good word for a lot of what I've seen in this book as well.
GINA: Cecilia, do you want to jump in on context and humanness?
CECILIA: Yeah, Nissenbaum's work was really important to us, it's one of the texts I think we've all read and that influenced our thoughts as well. I know I've been using the words “contextual integrity” even before we started writing this book. When I think back to a time about 15 years ago, when I first started working in data science and was introducing human-centered data science to some of the more hardcore scientists I was working with, I brought up this term and I used these arguments to show how this impacted the data and the results. Contextual integrity is a wonderful term that is accessible to people outside the social sciences. It’s a good way to explain to scientists who need to be persuaded by pure logic. The work we are doing builds on a mountain of tremendous work that has been done by many really smart and deep-thinking people. So we do give credit to all of them in the book.
GINA: Thank you. Thank you, all and what a note to end on. Thank you. As I said, I'm really honoured that MCTD, our new centre, was able to host this event. I am so grateful to my co-authors and my co-panellists, Casey, and thank all of you for joining us tonight. The details of the future events of the Minderoo Centre for Technology and Democracy can be found on our website and that's www.mctd.ac.uk. I would like to just mention, too, we are helping to host an event, Does AI Advance Gender Equality, and that's our next event on March 8, that's International Women's Day, and that's a report that we helped to co-author with the OECD, UNESCO and the Inter-America Development Bank, so that should be a really interesting event. There's also two events open for registration on Monday for the Cambridge Festival and this is a way to open up the university to broader conversations and broader communities and those events will open for registration next Monday, including an event on mis- and disinformation and an interesting data walk through Cambridge for those of you who will be joining us in geographic space. I'd also like to say that there's a possibility that we will have a splashy event at the end of March around data and data ethics - stay tuned for that on our website if that is coming through. Again, let me just - please continue following us on Twitter and other social media platforms at @MCTDCambridge. Thanks again, everybody, and please, you know, with the last two minutes that we have, if I could bring my co-panellists back in, if you have 20 seconds to say to the future students, "This is what you should do, go do it," I'll start with Cecilia. I'll put you on the spot - charge to the future.
CECILIA: The students today that I work with already know the direction they want to go. Why we wrote the book is to help give them tools - to help give you, all of you, tools - to be more effective in the very ethical and thoughtful path that all my students have taught me so much in. So hopefully we as the authors of this book can give something back. Thank you so much.
GINA: Anyone else? A charge to the future?
SHION: If I may, I've seen in our human-centered data science graduate programme as Cecilia pointed out most of the next generation are actually quite cognisant of these issues and just as a passing note I was looking at the numbers for our admissions and I was, like, astounded to see that in barely the second year of the programme there's 700-odd applications for just a few positions and every time I speak to a prospective student or a current student I'm just amazed at how much consciousness that they have, I was a theoretical statistician not long ago and I never learnt anything about human-centeredness, everything I learned about human-centeredness has been with my colleagues but I feel like most of the next generation is already well geared to do human-centered data science and I'll just end with the fact that you should always create good trouble wherever and whenever you are.
GINA: That is a great note to end on. We dedicate the book to our students because you all have inspired us and you continue to inspire us for the work that you do to make data science better. So on that note thank you for joining us tonight, today, wherever you are in the world, and looking forward to seeing you again, thanks so much.