For any folks in the Boston area, I’ve got a morning workshop scheduled for Monday 9/12 with Third Sector New England (TSNE), Tech Networks of Boston (TNB), and TNB Labs (TNBL). The workshop is called “Hands on Approaches to Data Storytelling.
My background is in education, so I’m always excited when I get run a workshop for teachers. Earlier this morning I had a chance to lead a workshop and conversation with 50 teachers from the Nord Anglia network of private schools, who have partnered with MIT Museum and the Cambridge Science Festival to think harder about STEAM education at various age levels.
I introduced a number of the activities I run, and the DataBasic.io suite. After each took a step back and asked participants to reflect on them as educators. This created some wonderful conversations about everything from building critical data thinking to the inspirations I draw from formal arts education. I look forward to chances to work with these teachers more!
Here’s a link the slides I used.
While at Stanford to talk about “ethical data” I had a chance to read through the latest issue of the Stanford Social Innovation Review within the walls where it is published. One particular article, Using Data for Action and Impact by Jim Fruchterman, caught my eye. Jim lays out an argument for using data to streamline operational efficiencies and monitoring and evaluation within non-profit organizations. This hit one of my pet peeves, so I’m motivated to write a short response arguing for a more expansive approach to thinking about non-profit’s use of data.
This idea that data is confined to operational efficiency creates a missed opportunity for organizations working in the social good sector. When giving talks and running workshops with non-profits I often argue for three potential uses of data – improving operations, spreading the message, and bringing people together. Jim, who’s work at Benetech I respect greatly, misses an opportunity here to broaden the business case to include the latter two.
Data presents non-profits with an opportunity to engage the people they serve in an empowering and capacity-buiding way, reinforcing their efforts towards improving conditions on whatever issue they work on. Jim’s “data supply chain” presents the data as a product of the organization’s work, to be passed up the funding ladder for consumption at each level. This extractive model needs to be rethought (as Catherine D’Ignazio and I have argued). The data collected by non-profits can be used to bring the audiences they serve together to collaboratively improve their programs and outcomes. Think, for example, about the potential impacts for the Riders for Health organization he discusses if they brought drivers together to analyze the data about their routes and distances. I wonder about the potential impacts of empowering the drivers to analyze the data themselves and take ownership of the conclusions.
Skeptical that you could bring people with low data literacy together to analyze data and find a story in it? That is precisely a problem I’ve been working on with my Data Mural work. We have a process, scaffolded by many hands-on activities, that leads a collaborative groups through analyzing some data to find a story they want to tell, designing a visual to tell that data-driven story, and paint it as a mural. We’ve worked with people around the world to do this. Picking it apart leaves us with a growing toolkit of activities being used by people around the world.
Still skeptical that you can bring people together around data in rural, uneducated settings? My colleague Anushka Shah recently shared with me the amazing work of Praxis India. They’ve brought people together in various settings to analyze data in sophisticated ways that make sense because they rely on physical mappings to represent the data.
These examples illustrate that the social good non-profits can deliver with data is not constrained to operational efficiencies. We need to highlight these types of examples to move away from a story about data and monitoring, to one about data and empowerment. In particular, thought leaders like SSIR and Jim Fruchterman should push for a broader set of examples of how data can be used in line with the social good mission of non-profits around the world.
Cross-posted to the civic.mit.edu blog.
I’m fascinated by food and data. I’ve been doing food security data murals, my Data Storytelling Studio class in 2015 focused on food security data, and I’ve been laser-cutting data onto veggies for public events.
So not surprisingly I was excited to see Data Cuisine coming to Boston for a workshop by Suzanne Jaschko and Moritz Stefaner! Sadly I’m out of town and can’t make the workshop, but it sparked me thinking about food and data, and creative data representation a bit.
When doing data presentation in a creative medium, you have to choose your mappings and datasets carefully. I’m often introducing people to more creative techniques for data presentation for the first time, and argue the strongest stories come when the message matches the medium well. For example, one of their participant projects maps tomato and basil in a dish to the amount of Italian speakers. This is a fairly culturally loaded mapping, that many would understand. However, others are more abstract. One mapped people to noodles to discuss sexual habits. A stronger mapping is the project that makes a joke about “death by chocolate” by creating small caskets to tell the story of common causes of death in Belgium.
Another intriguing example is Dan Barber’s red pepper egg (featured in an episode of Netflix’s Chef’s Table show).
He worked with a farmer to breed super colorful red peppers, then fed a mash of them to chickens to create the red yolk you see above! Why? All to start a person about to eat the egg wondering how it got that red. What did the chicken eat to make that happen? Why have I never thought about the supply chain going into this egg before?
To me, Barber’s red pepper egg is a wonderful example of data representation as food. The food chain data in beautifully captured in the red yolk, and it prompts you to ask questions directly aligned with his goals in presenting it. Wonderful!
More abstract representations of data in food are like a missed opportunity to me. The artistic merit can be there, but leaves the viewer hungry for more. A strong mapping between the medium of your data presentation, and the data and story itself, is key to creating a lasting impression.
I recently gave a short talk at a Data Science event put on by Deloitte here in Boston. Here’s a short write up of my talk.
Data science and big data driven decisions are already baked into business culture across many fields. The technology and applications are far ahead of our reflections about intent, appropriateness, and responsibility. I want to focus on that word here, which I steal from my friends in the humanitarian field. What are our responsibilities when it comes to practicing data science? Here are a few examples of why this matters, and my recommendations for what to do about it.
People Think Algorithms are Neutral
I’d be surprised if you hadn’t heard about the flare-up about Facebook’s trending news feed recently. After breaking on Gizmodo if has been covered widely. I don’t want to debate the question of whether this is a “responsible” example or not. I do want to focus on what it reveals about the public’s perception of data science and technology. People got upset, because they assumed it was produced by a neutral algorithm, and this person that spoke with Gizmodo said it was biased (against conservative news outlets). The general public thinks algorithms are neutral, and this is a big problem.
Algorithms are artifacts of the cultural and social contexts of their creators and the world in which they operate. Using geographic data about population in the Boston area? Good luck separating that from the long history of redlining that created a racially segregated distribution of ownership. To be responsible we have to acknowledge and own that fact. Algorithms and data are not neutral third parties that operate outside of our world’s built-in assumptions and history.
Some Troubling Examples
Lets flesh this out a bit more with some examples. First I look to Joy Boulamwini, a student colleague of mine in the Civic Media group at the MIT Media Lab. Joy is starting to write about “InCoding” – documenting the history of biases baked into the technologies around us, and proposing interventions to remedy them. One example is facial recognition software, which has consistently been trained on white male faces; to the point where she has to literally done a white-face mask to have the software recognize her. This just the tip of the iceberg in computer science, which has a long history of leaving out entire potential populations of users.
Another example is a classic one from Latanya Sweeney at Harvard. In 2013 She discovered a racial bias trained into the operation Google’s AdWords platform. When she searched for names that are more commonly given to African Americans (liked her own), the system popped up ads asking if the user wanted to do background checks or look for criminal records. This is an example of the algorithm reflecting built-in biases of the population using it, who believed that these names were more likely to be associated with criminal activity.
My third example comes from an open data release by the New York City taxi authority. They anonymized and then released a huge set of data about cab rides in the city. Some enterprising researchers realized that they had done a poor job of anonymizing the taxi medallion ids, and were able to de-anonymize the dataset. From there, Anthony Tockar was able to find strikingly juicy personal details about riders and their destinations.
A Pattern of Responsibility
Taking a step back form these three examples I see a useful pattern for thinking about what it means to practice data science with responsibility. You need to be responsible in your data creation, data impacts, and data use. I’ll explain each of those ideas.
Being responsible in your data collection means acknowledging the assumptions and biases baked into your data and your analysis. Too often these get thrown away while assessing the comparative performance between various models trained by a data scientist. Some examples where this has failed? Joy’s InCoding example is one of course, as is the classic Facebook “social contagion” study. A more troubling one is the poor methodology used by US NSA’s SkyNet program.
Being responsible in your data impacts means thinking about how your work will operate in the social context of its publication and use. Will the models you trained come with a disclaimer identifying the populations you weren’t able to get data from? What are secondary impacts that you can mitigate against now, before they come back to bite you? The discriminatory behavior of the Google AdWords results I mentioned earlier is one example. Another is the dynamic pricing used by the Princeton Review disproportionately effecting Asian Americans. A third are the racially correlated trends revealed in where Amazon offers same-day delivery (particularly in Boston).
Being responsible in your data use means thinking about how others could capture and use your data for their purposes, perhaps out of line with your goals and comfort zone. The de-anonymization of NYC taxi records I mentioned already is one example of this. Another is the recent harvesting and release of OKCupid dating profiles by researchers who considered it “public” data.
Leadership and Guidelines
The problem here is that we have little leadership and few guidelines for how to address these issues in responsible ways. I have yet to find an handbook for a field that scaffolds how to think about these concerns. As I’ve said, the technology is far ahead of our reflections on it together. However, that doesn’t mean that they aren’t smart people thinking about this.
In 2014 the White House brought together a team to create their report on Big Data: Seizing Opportunities, Preserving Values. The title itself reveals their acknowledgement of the threat some of these approaches have for the public good. Their recommendations include a number of things:
- extending the consumer bill of rights
- passing stronger data breach legislation
- protecting student centered data
- identifying discrimination
- revising the Electronic Communications Privacy Act
Legislations isn’t strong in this area yet (at least here in the US), but be aware that it is coming down the pipe. Your organization needs to be pro-active here, not reactive.
Just two weeks ago, the Council on Big Data, Ethics and Society released their “Perspectives” report. This amazing group of individuals was brought together to create this report by a federal NSF grant. Their recommendations span policy, pedagogy, network building, and area for future work. The include things like:
- new ethics review standards
- data-aware grant making
- case studies & curricula
- spaces to talk about this
- standards for data-sharing
These two reports are great reading to prime yourself on the latest high-level thinking coming out of more official US bodies.
So What Should We Do?
I’d synthesize all this into four recommendations for a business audience.
Define and maintain our organization’s values. Data science work shouldn’t operate in a vacuum. Your organizational goals, ethics, and values should apply to that work as well. Go back to your shared principles to decide what “responsible” data science means for you.
Do algorithmic QA (quality and assurance). In software development, the QA team is separate from the developers, and can often translate between the languages of technical development and customer needs. This model can server data science work well. Algorithmic QA can discover some of the pitfalls the creators of models might not.
Set up internal and and external review boards. It can be incredibly useful to have a central place where decisions are made about what data science work is responsible and what isn’t for your organization. We discussed models for this at a recent Stanford event I was part of.
Innovate with others in your field to create norms. This stuff is very new, and we are all trying to figure it out together. Create spaces to meet and discuss your approaches to this with others in your industry. Innovate together to stay ahead of regulation and legislation.
These four recommendations capture the fundamentals of how I think businesses need to be responding to the push to do data science in responsible ways.
This post is cross-posted to the civic.mit.edu website.
The Digital Civil Society Lab at Stanford recently hosted a small gathering of people to dig into emerging processes for ethical data review. This post is a write up of the publicly shareable discussions there.
Lucy Berholz opened the day by talking about “digital civil society” as an independent space for civil society in the digital sphere. She is specifically concerned with how we govern the digital sphere in line with the a background of democracy theory. We need to use, manage, govern in ways that are expansive and supportive for independant civil society. This requires new governance and review structures for digital data.
This prompted the question of what is “something like an IRB and not an IRB”? The folks in the room bring together corporate, community, and university examples. These encompass ethical codes and the processes for judging adherence to them. With this in mind, in the digital age, do non-profits need to change? What are the key structures and governance for how they can manage private resources for public good?
Lasanna Magassa (Diverse Voices Project at UW)
Lasanna introduced us all to the Diverse Voices Project, an “An exploratory method for including diverse voices in policy development for emerging technologies”. His motivations lie in the fact that tech policy is generally driven by mainstream interests, and that policy makers are reactive.
They plan and convene “Diverse Voices Panels”, full of people whole live an experience, institutions that support them, people somehow connected to them. In a panel on disability this could be people who live it and are disabled, law & medical professionals, and family members. These panels produce whitepapers that document and then make recommendations. They’ve tackled everything from ethics and big data, to extreme poverty, to driverless cars. They focus on what technology impacts can be for diverse audiences. One challenge they face is finding and compensating panel experts. Another is wondering how to prep a dense, technical document for the community to read.
Lasanna talks about knowledge generation being the key driver, building awareness of diversity and the impacts of technologies on various (typically overlooked) subpopulations.
Eric Gordon (Engagement Lab at Emerson College)
Eric (via Skype) walked us through the ongoing development of the Engagement Lab’s Community IRB project. The goal they started with was to figure out what a Community IRB is (public health examples exist). It turned out they ran into a bigger problem – transforming relationships between academia and community in the context of digital data. There is more and more pressure to use data in more ways.
He tells us that in Boston area, those who represent poorer folks in the city are asked for access to those populations all the time. They talked to over 20 organizations about the issues they face in these partnerships, focusing on investigating the need for a new model for the relationships. One key outcome was that it turns out nobody knows what an IRB is; and the broader language use to talk about them is also problematic (“research”, “data”).
They ran into a few common issues to highlight. Firstly, there weren’t clear principles for assuring value for those that give-up their data. In addition, the clarity of the research ask was often weak. There was a all-to-common lack of follow-through, and the semester-driven calendar is a huge point of conflict. An underlying point was that organizations have all this data, but the outside researcher is the expert that is empowered to analyze it. This creates anxiety in the community organizations.
They talked through IRBs, MOUs, and other models. Turns out people wanted to facilitate between organizations and researchers, so in the end what they need is not a document, but a technique for maintaining relationships. Something like a platform to match research and community needs.
Molly Jackman & Lauri Kanerva (Facebook)
Molly and Lauri work on policy and internal research management at Facebook. They shared a draft of the internal research review process used at Facebook, but asked it not be shared publicly because it is still under revision. They covered how they do privacy trainings, research proposals, reviews, and approvals for internal and externally collaborative research.
Nicolas de Corders (Orange Telekom)
Nicolas shared the process behind their Data for Development projects, like their Ivory Coast and Senegal cellphone data challenges. The process was highly collaborative with the local telecommunications ministries of each country. Those conversations produced approvals, and key themes and questions to work on within the country. This required a lot of education of various ministries about what could be done with the cellphone call metadata information.
For the second challenge, Orange set up internal and external review panels to handle the submissions. The internal review panels included Orange managers not related to the project. The external review panel tried to be a balanced set of people. They built a shared set of criteria by reviewing submissions from the first project in the Ivory Coast.
Nicolas talks about these two projects as one-offs, and scaling being a large problem. In addition, getting the the review panels to come up with shared agreement on ethics was (not surprisingly) difficult.
After some lunch and collaborative brainstorming about the inspirations in the short talks, we broke out into smaller groups to have more free form discussions about topics we were excited about. These included:
- an international ethical data review service
- the idea of minimum viable data
- how to build capacity in small NGOs to do this
- a people’s review board
- how bioethics debates can be a resource
I facilitated the conversation about building small NGO capacity.
Building Small NGO Capacity for Ethical Data Review
Six of us were particularly interested in how to help small NGOs learn how to ask these ethics questions about data. Resources exist out there, but not well written enough for people in this audience to consume. The privacy field especially has a lot of practice, but only some of the approaches there are transferrable. The language around privacy is all too hard to understand for “regular people”. However, their approach to “data minimization” might have some utility.
We talked about how to help people avoid extractive data collection, and the fact that it is disempowering. The non-profit folks in the group reminded us all that you have to think about the funder’s role in the evidence they are asking for, an how they help frame questions.
Someone mentioned that law can be the easiest part of this, because it is so well-defined (for good or bad). We have well established laws on the fundamental privacy right of individuals in many countries. I proposed participatory activities to learn these things, like perhaps a group activity to try and re-identify “anonymized” data collected from the group. Another participant mentioned DJ Patel’s approach to building a data culture.
Our key points to share back with the larger group were that:
- privacy has inspirations, but it’s not enough
- communications formats are critical (language, etc); hands-on, concrete, actionable stuff is best
- you have to build this stuff into the culture of the org
Recently at the Responsible Visualization event put on the by the Responsible Data Forum I had a wonderful chance to sit down with the amazing Patrick Ball from the Human Rights Data Group and talk through how we help groups learn about working with incomplete data.
With my focus on capacity building, I’m trying to find fun ways for NGOs to learn about accuracy and data at a very basic level. Patrick agues that in fact you need rigorous statistical analysis to do this well, from his background in human rights data. I pushed a bit, asking him is there was a 80/20 shortcut. His response was to paint a great distinction between homogenous and heterogenous observability of data. For instance, there are many examples of questions that don’t require quantitative rigor – case existence, case history, etc. This sparked a fun conversation about visual techniques for conveying uncertainty.
Watch the video to see the short conversation, or just catch the audio below.
The semester has started again at MIT, which means I’m teaching a new iteration of my Data Storytelling Studio course. One of our first sessions focuses on learning to ask questions of your data… and this year that was a great change to use the new WTFcsv tool I created with Catherine D’Ignazio.
The vast majority of the students decided to work with our fun UFO sample data. They came up with some amazing questions to ask, with a lot of ideas about connecting it to other datasets. A few focused in on potential correlations with sci-fi shows on TV (perhaps inspired by the recent reboot of the X Files).
One topic I reflected on with students at the close of the activity was that the majority of their questions, and the language they used to describe them, came from a point of view that doubted the legitimacy of these UFO sightings. They wanted to “explain” the “real” reason for what people saw. They were assuming that the sightings were people imagining what they saw was aliens, which of course couldn’t be true.
Now, with UFO sightings this isn’t especially offensive. However, with datasets about more serious topics, it’s important to remember that we should approach them from an empathetic point of view. If we want to understand data reported by people, we need to have empathy for where the data reporter is coming from, despite any biases or pre-existing notions we might have about the legitimacy of the what they say happened.
This isn’t to say that we shouldn’t be skeptical of data; by all means we should be! However, if we only wear our skeptical hat we miss a whole variety of possible questions we could be asking our dataset.
So, when it comes to UFO sightings, be sure to wonder “What would Mulder do?”
Just yesterday at I was in a room of amazing friends, new and old, talking about what responsible data visualization might be. Organizing by the Engine Room as part of their series of Responsible Data Forums (RDF), this #RDFViz event brought together 30 data scientists, community activists, designers, artists and visualization experts to tease apart a plan of action for creating norms for a responsible practice of data visualization.
Here’s a write up of how we tackled that in the small group I led about what that means when building visual literacy.
Building Literacy for Responsible Visualization
I’ve written a bunch about data literacy and the variety of ways I try to build it with community groups, but we received strict instructions to focus this conversation on visualization. That was hard! So we started off by making sure we understood the audiences we were talking about – people who make visualizations and people who see/read them. So many ways to think about this… so many questions we could address… we were lost for a bit about where to even start!
We decided to pick four guiding questions to propose to ourselves and all of you, and then answer them by sketching about quick suggestions for things that might help.
- How can visual literacy for data be measured?
- How can existing resources for data visualization read the growing non-technical data visualization producers?
- How can we teach readers to look at data visualization more critically?
- How can we help data visualization producers to design more appropriately for their audiences?
A difficult set of questions, but our group of four dove into them unafraid! Here’s a quick run-down on each. For the record, I only worked on two of these, so I hope I do justice to the other two I didn’t directly dig into.
Measuring Visual Literacy
This is a tricky task, fraught with cultural assumptions. We began by defining it down to the dominant visual form for representing data – namely classic charts and graphs. This simplified the question a little, but of course buys into power dynamics and all that stuff that comes along with it.
Our idea was to create an interactive survey/game that asks people to read and reason about visualizations. Of course this draws on a lot of existing research into visual- and data-literacy, but in that body of work we don’t have an agreed-upon set of questions to assess this. So we came up with the following topics, and example questions as a thing to think about.
- Can you read it? This topic tried to address the question of basic visual comprehension of classic charting. The example question would show something like a bar chart and ask “What is the highest value?”.
- What would you do? This topic digs into making reasoned judgements about personal decisions based on information show in a visual form. The example question is a line chart showing vaccination rates over time going down and people getting measles going up; asking “Would you vaccinate your children?”.
- What can you tell? Another topic to address is making judgements about whether data shows a pattern or not. The example question would show a statement like “Police kill women more than men – true or false?” and the answers could be “true”, “false” and “can’t tell”.
- What’s the message? More complex combinations of charts and graphs are often trying to deliver a message to the reader. Here we could show a small infographic that documents corruption somewhere. Then we’d ask “What is the message on this graphic?” with possible answers of “corruption is rampant”, “corruption happens” and “public funds are too high”.
There are just four topics, and we know there are more. We’re excited about this survey, and hope to find time and funds to review existing surveys that assess various types of literacies so we can build a good tool to help people measure these types of literacies in various communities!
Choosing the Right Visualization for Your Audience
We have a vast, and growing array of visualization techniques available to us, but few guidelines on how to use them appropriately for different audiences. This is problematic, and a responsible version of data visualization should respect where and audience is coming from and their visual literacy. With that in mind, we propose to create a library of case studies where each one creates different visualizations from the same dataset, making the same argument, for different audiences.
For example, we sketched out ways to argue that police violence is endemic in the US, based on a theoretical dataset that captures all police-related killings. For a low visual literacy individual (maybe a 10-year old kid) you could start by showing a face of one victim, and then zoom out to a grid of all the victims to show scale of the problem while still humanizing it. For the medium literacy audience (those that watch the evening news each night on tv), you could show a line chart of killings by year. For a high literacy audience (reading the New York Times) you could do an interactive map that shows killings around the reader’s location as they compare to nation-wide trends.
You could imagine a library of many of these, which we think would help people think about what is appropriate for various audiences. I’m excited to assign this to students in my Data Storytelling Studio course as an assignment!
Learning to Read A Data Visualization
Our idea here was to create a quick how-to guide that lists things you should ask when reading a data visualization. Imagine a listicle called “15 Things to Check in any Data Visualization”! The problem here is that people aren’t being introduced to the critical techniques for reading visualization, to identify when one is being irresponsible.
Some things that might on this list include:
- Is the data source identified?
- Are the axes labelled correctly?
- What is the level of aggregation?
This list could expose some of the common techniques for creating misleading visualizations. Next steps? We’d like to crowd source the completion of the list to make sure we don’t miss any important ideas.
Helping Non-Experts Learn to Make Data Visualizations
This is a huge problem. The hype around data visualization continues to grow, and more and more tools are being created to help non-experts make them. Unfortunately, the materials we use to help these newcomers into the field haven’t kept pace with the huge rise in interest!
We proposed to address this by better defining what these new audience need to know. They include:
- human rights organizations
- community groups
- social movements
And more! A brief brainstorm resulted in this list of things they are trying to learn:
- how to select the right data to visualize?
- what types of charts are best suited to understand what types of data?
- what cultural assumptions are reflected in what types of dataviz?
- how do design decisions (eg. color) impact on how readers will understand your data visualization?
This is just a preliminary list of course.
Rounding it Up
Just kidding… we have a lot of work to do if we want to build a responsible approach to literacies about data visualization. These four suggestions from our small working group at the RDFViz event are just that – suggestions. However, the space to approach this from a responsible point of view, and the conversations and disagreements were invaluable!
Many thanks to the organizers and funders, including our facilitator Mushon Zer-Aviv, our organizers at the Engine Room, our hosts at ThoughtWorks, Data & Society and Data-Pop Alliance, and our sponsors at Open Society Foundations and Tableau Foundation. This is cross-posted to the MIT Center for Civic Media website.