UN Data Forum – Data Literacy: What, Why and How? (liveblog)

This is a liveblog written by Rahul Bhargava at the 2017 UN World Data Forum.  This serves as a summary of what the speakers spoke about, not an exact recording.  With that in mind, any errors or omissions are likely my fault, not the speakers. 

This panel has four speakers on the topic of data literacy, with an emphasis on front-line, practical things.

Empowering Future Users through Data Literacy – Professor Delia North

Dean and Head of Math, Statistics and Computer Science in Universty of Kwazulu-Natal Durban.  She wants to spread the message of empowering people (a theme for this session).  Prof North, teaching over 30 years, works on curriculum design for school level teacher training.  She has a passion for statics and youth, at the national level in addition to within her university.

The need to maintain a competitive economy drives the need for statistical literacy from basic operations, to the PhD level.  All citizens need basic statistical literacy, for basic citizenship; best to accomplish this while they are in school. Professionals need competence to use statistics effectively in the workplace. Specialists need to continually improve their practice.  University tends to think everyone is on the path to becoming a mathematical statistician, but this is an old-fashioned approach.  This isn’t developing them as “consumers” of statistics.

Statistics is often introduced as “hidden” inside of mathematics, so this is what people in South Africa think about.  That doesn’t identify it as a job opportunity to learners. In addition, statisticians are poor at marketing their discipline. It is viewed as difficult, boring and confusing.  There is a shortage of skills, and an overestimation of ability.  The best statisticians go to industry, so universities are left understaffed.  There are “too few enablers” of statistical literacy.

Data used to be scarce, but now it is everywhere.  This requires a rethink of the way we introduce statistics. This involves bringing in more data, and teaching with new methods.  Students need to be actively involved with working with large datasets.  This is an opportunity, not a threat. The questions we ask on our assessments are calculator-driven, not focused on analytical thinking.

Data literacy is an essential part of statical literacy.  Decisions based on data should be part of the statistical literacy training. Statistics should be an applied mathematics applied within another discipline.  For example, they collected rubbish with children and had them track the amount and graph it. You can’t keep it trapped in mathematics classes.  You have to make learning these concepts fun!  Engaging workshops can radically change how empowered a group of teachers feels to introduce statistics.  They want to learn new teaching methods.  You have to teach them at the beginning to introduce things in the right way.

Empowering Users in Situ – Dr. Sati Naidu 

Executive Manager for Staekholder Relations for Statistics South Africa.  Stats SA has moved away from selling the data to helping people use the data for making evidence-making decisions. In 1996 South Africa did its first census. The first CD they produced cost 100,000 USD.  Now data collection is scattered across all the departments.  That should all be available on one platform to drive decision making.  They set up CRUISE, to merge a course for statistics, GIS, planning, and economics all together.  Dr. Naidu attended this course and learned much about a geographic approach to statistics.  Mapping can reveal patterns that are otherwise hidden in traditional analytical means.  This is demonstrated with a powerful set of maps that show the incidence of HIV/AIDS over time across Africa.

Now Stats SA creates GIS to create a platform to combine geometry, shape-files, and more. This lets them create thematic maps very easily. They offer trainings on these tools throughout South Africa.

Another example is looking at piped water over time, to see an increase.  With the map you can see which areas improved, and look for patterns in those with low or high services.  You can run hotspot analysis to look at unemployment data. You can do geospatial analysis to look for outliers and then look for causes.

When data is non-stationary you can’t just use traditional statistical analysis. For instance new houses are much more expensive than old houses in most of Cape Town. But in one area, new houses are very cheap because of the location.  So in one part of town there is a positive correlation, and in another there is a negative one.  You can find this with geographically weighted regression (GWR), while it would be hidden in a traditional regression.

Stats SA has all the official data.  Now they want to engage with private providers to make their data available.  We need to change from Big Data to Open Data, to go from its size to how it is used.

Data Literacy for Capacity Building – Dr. Blandina Kilama

Dr. Kilama works for REPOA on Poverty Research in Tanzania. REPOA is a think-thank in Tanzania that undertakes policy research.  She also teaches statistics part-time, and will share some of her learnings from there.

The stakeholders vary form Policy Makers, to Academia, to Media, to CSOs. Tanzania, has agriculture, This matters when politicians and others often conflate things like employment and productivity when talking about growth. Most African countries are seeing growth from productivity, not from labor.  For instance, agriculture, industry and services contribute roughly equally in terms of the economy.  However, more than 70% of the labour force works in agriculture.

This capacity causes problems sometimes.  For instance REPOA produced some poverty maps that were used by policy makers, leading to reactions of surprise and accusations.  Spatial analysis helped them explain this better, but showing how districts next o cities experience growth, while districts next to refugee camps showed lack of growth.

For media, REPOA builds in flexibility. They do half-day trainings, and make topics relevant for their current work.  These fit the media workers schedules, between their morning checkins and afternoon deadlines.

The challenges include weak numerical literacy, a shift in policies, and a lack of time. In Tanzania there is a common saying “we are all scared of numbers.”  This attitude is a real social challenge to conquer; the stakeholders have a deep fear of numbers. Policies need to shift to include the idea that people providing the data are protected, and experience benefits from it.

Data and Statistics: the sciences, the literacies and collaboration – Professor Helen MacGillivray

Dr. MacGillivray is a high-level mathematical statistician, and heavily involved with teacher training. Works in Australia, but is the incoming President of International Statistics Institute.  This is a big topic, and the challenges reflect that.

In Australia, the people involved in teaching are the ones thinking about what is data literacy, and what is data science. There are valuable lessons in the decades of work on building statistical literacy.  The include work within the other disciples.  Some tidbits include the idea that descriptions are better than definitions, and that discussion is essential, but diagrammatic representations are not.

Statistical literacy focusing on understanding, consuming information, and interpreting and critically thinking about. This differs at grade levels. The curricula has an aim of helping you look behind the data, ask why it is presented, and what questions can be asked.

With data literacy there aren’t many definitions around. The ones that exist vary. Some split this between information literacy and data management.

Why is this important?  It is for everyone to the extent appropriate for their level of education, training, and work. This is very contextual, so it is a constant learning.

How do you do this?  Models at the governmental level are actually decades old.  The emphasis is on the problems, the plan, getting the data, analyzing, and then discussions and interpretation.  Dr. MacGillivray, in her workshops with teachers, encourages them to not think about the problem and the answer.  This work is much wider than that.   At the professional level, current approaches lead statisticians to think that they should NOT be involved with the collection of data; that somehow that gets their hands dirty.  They think it is a waste of a statisticians valuable time.  Nothing could be further form the truth.

In terms of penetration, there is lots of practice, but current teaching methods are still buried in old practices. They need to use complex, many-variabled datasets.  This leads to impediments for data literacy and data science.  Instead of a misplaced focus on calculation as in staticialy literacy education, in data science education there is a misplaced focus on coding.


Q & A

How about grassroots data literacy – what school do I send my students to?  can students analyze air quality?  Part of data literacy is knowing data is important for decisions making.

Prof North responds about the import of sourcing of data, what it is, where it came from, why it was collected is critical. Now we try to use household data that is from the world of the student.  You can use larger datasets, but still from the world of the student.

In terms of data availability, is there a way to asses the data literacy levels of different countries? How can we do better outreach?

Prof Naidu responds that, In terms of dissemination, now Stats SA takes the data to the people.  They have huge publicity campaigns to argue for collection; and then takes the results back to the people.

The SDGs combine social, economic, and environmental measurements. The average person on the street that is the target for behavior change, needs to understand the links between the three.  Where does scientific literacy come into this?

Prof MacGillivray reminds us that this is an old question, because these literacies operate within context in other fields.  We have to work with other disciplines and their educations.  Prof North adds that at her university they implemented practices that try to involve the other disciplines.  So if a student came in for help from another department, they involved the supervisor.  Dr. Kilama adds that in her country collecting the environmental data collection is the challenge they face.

Using data literacy as a means to protect poeple from fake statistics.  VIsualization can make bad statistics very acceptable.  We need to educate people about how to differentiate between good data and good-looking data.

This is the focus of the critical approaches.

Regarding adaptability for developing countries, places where connectivity is quite low?  Can we use radio for this?

This is our perspective from the Netherlands, so we don’t have good approaches already. Perhaps other people in the room do.

Two New Academic Papers

If you’ve been to my hands-on workshops, you might be surprised to hear I’m also the “academic paper” kind of guy.  In fact, my position here as Research Scientist at the MIT Media Lab means that one of the way I contribute is by publishing academic papers.  I have two of those in the latest issue of the International Journal of Community Informatics, a special edition on Data Literacy.  Give them a read if you want a deeper look into either how our Data Murals work, or into the design and use of our DataBasic.io suite of activities and tools.


Data Murals: Using the Arts to Build Data Literacy

Rahul Bhargava, Ricardo Kadouaki, Emily Bhargava, Guilherme Castro, Catherine D’Ignazio

Current efforts to build data literacy focus on technology-centered approaches, overlooking creative non-digital opportunities. This case study is an example of how to implement a Popular Education-inspired approach to building participatory and impactful data literacy using a set of visual arts activities with students at an alternative school in Belo Horizonte, Brazil.  As a result of the project data literacy among participants increased, and the project initiated a sustained interest within the school community in using data to tell stories and create social change.

DataBasic: Design Principles, Tools and Activities for Data Literacy Learners

Catherine D’Ignazio, Rahul Bhargava

The growing number of tools for data novices are not designed with the goal of learning in mind. This paper proposes a set of pedagogical design principles for tool development to support data literacy learners.  We document their use in the creation of three digital tools and activities that help learners build data literacy, showing design decisions driven by our pedagogy. Sketches students created during the activities reflect their adeptness with key data literacy skills. Based on early results, we suggest that tool designers and educators should orient their work from the outset around strong pedagogical principles.


Data Haves and Data Have-Nots

This week I’m at the Data Literacy Conference in France. One of the reasons I’m super excited about this because it is a gathering of people I’ve been wanting to talk to for years! Although there are tons of conferences about data, they are few conferences focused on the literacy aspect, so I thank Fing for putting this together.  Catherine D’Ignazio and I both presented a talk and workshop.  You see can see our slides for our talk about Bridging the Gap Between Data Haves and Data Haven-Nots.  It focused on describing how to help two audiences:

  1. We want to help those in power, the “Data-Haves”, learn how to present their data in more appropriate ways.
  2. We want to help those that don’t usually have power, the “Data Have-Nots”, build their capacity to use data to create change in the world around them.

Too often we focus on just the second goal, ignoring the needs of those that have the data.


We also ran a workshop for about 20 attendees, focused on how our DataBasic activities can help build data literacy in a variety of ways.

Overall the conference was a wonderful gathering of like-minded individuals.  Catherine and live-blogged the plenary talks:

Talking Data with Museum Visitors

Last weekend I had the pleasure of running a data sculpture workshop for the public at the MIT museum’s Idea Hub. They offer hands on activities for museum visitors every Sunday, and after chatting we decided to try adding my activity to the lineup. With an amazing set of craft materials, and some one-page data prompts about MIT, we invited visitors to drop in and find data-driven stories they could tell by building simple sculptures.  The sheets included information about the amount of sleep students get, the cost of undergraduate education in the US, and happiness in Somerville.

It was so fun to be able to have his conversation with a random set of curious folks. As we built things we chatted about loads of topics related to data literacy. Some people dig into how you could find simple or complex stories in such small datasets. Others explored how to present the impact of the data, not the data itself. Some decided to use totally different data, related to their lives. This variety created a great set of evocative examples that made discussions later in the afternoon even richer.

I used to do a lot more museum works, so it was a pleasure to be back in that setting. Museums prime people’s brains to be curious, so it’s wonderful to offer an invitation i that space to discuss and explore a topic more deeply. Actually when I was a student here at MIT i volunteered at the museum, helping run robotics workshops for kids and adults with my good friend Stephanie Hunt. It felt great to be back!

I look forward to dropping in when the museum staff runs this on their own. Can’t wait to see how they make it even better.

Here is a list of some of the data sculptures people made:

Tools for Teachers

My background is in education, so I’m always excited when I get run a workshop for teachers.  Earlier this morning I had a chance to lead a workshop and conversation with 50 teachers from the Nord Anglia network of private schools, who have partnered with MIT Museum and the Cambridge Science Festival to think harder about STEAM education at various age levels.


I introduced a number  of the activities I run, and the DataBasic.io suite. After each took a step back and asked participants to reflect on them as educators.  This created some wonderful conversations about everything from building critical data thinking to the inspirations I draw from formal arts education. I look forward to chances to work with these teachers more!

Here’s a link the slides I used.


Using Data for More than Operations

While at Stanford to talk about “ethical data” I had a chance to read through the latest issue of the Stanford Social Innovation Review within the walls where it is published.  One particular article, Using Data for Action and Impact by Jim Fruchterman, caught my eye.  Jim lays out an argument for using data to streamline operational efficiencies and monitoring and evaluation within non-profit organizations.  This hit one of my pet peeves, so I’m motivated to write a short response arguing for a more expansive approach to thinking about non-profit’s use of data.

This idea that data is confined to operational efficiency creates a missed opportunity for organizations working in the social good sector. When giving talks and running workshops  with non-profits I often argue for three potential uses of data – improving operations, spreading the message, and bringing people together. Jim, who’s work at Benetech I respect greatly, misses an opportunity here to broaden the business case to include the latter two.Data_Architecures_Workshop___SSIR_Data_on_Purpose

Data presents non-profits with an opportunity to engage the people they serve in an empowering and capacity-buiding way, reinforcing their efforts towards improving conditions on whatever issue they work on. Jim’s “data supply chain” presents the data as a product of the organization’s work, to be passed up the funding ladder for consumption at each level. This extractive model needs to be rethought (as Catherine D’Ignazio and I have argued).  The data collected by non-profits can be used to bring the audiences they serve together to collaboratively improve their programs and outcomes.  Think, for example, about the potential impacts for the Riders for Health organization he discusses if they brought drivers together to analyze the data about their routes and distances.  I wonder about the potential impacts of empowering the drivers to analyze the data themselves and take ownership of the conclusions.

Skeptical that you could bring people with low data literacy together to analyze data and find a story in it?  That is precisely a problem I’ve been working on with my Data Mural work. We have a process, scaffolded by many hands-on activities, that leads a collaborative groups through analyzing some data to find a story they want to tell, designing a visual to tell that data-driven story, and paint it as a mural.  We’ve worked with people around the world to do this.  Picking it apart leaves us with a growing toolkit of activities being used by people around the world.

Still skeptical that you can bring people together around data in rural, uneducated settings? My colleague Anushka Shah recently shared with me the amazing work of Praxis India. They’ve brought people together in various settings to analyze data in sophisticated ways that make sense because they rely on physical mappings to represent the data.

Charting crop production and rainfall trends over time.
Yes, that looks like a radar chart to me too.

These examples illustrate that the social good non-profits can deliver with data is not constrained to operational efficiencies.  We need to highlight these types of examples to move away from a story about data and monitoring, to one about data and empowerment.  In particular, thought leaders like SSIR and Jim Fruchterman should push for a broader set of examples of how data can be used in line with the social good mission of non-profits around the world.

Cross-posted to the civic.mit.edu blog.

Visualizing with Food

I’m fascinated by food and data.  I’ve been doing food security data murals, my Data Storytelling Studio class in 2015 focused on food security data, and I’ve been laser-cutting data onto veggies for public events.

One of my laser-cut veggies showing local food security data.

So not surprisingly I was excited to see Data Cuisine coming to Boston for a workshop by Suzanne Jaschko and Moritz Stefaner! Sadly I’m out of town and can’t make the workshop, but it sparked me thinking about food and data, and creative data representation a bit.

When doing data presentation in a creative medium, you have to choose your mappings and datasets carefully.  I’m often introducing people to more creative techniques for data presentation for the first time, and argue the strongest stories come when the message matches the medium well.  For example, one of their participant projects maps tomato and basil in a dish to the amount of Italian speakers.  This is a fairly culturally loaded mapping, that many would understand.  However, others are more abstract.  One mapped people to noodles to discuss sexual habits. A stronger mapping is the project that makes a joke about “death by chocolate” by creating small caskets to tell the story of common causes of death in Belgium.

Examples from a recent data-cuisine workshop

Another intriguing example is Dan Barber’s red pepper egg (featured in an episode of Netflix’s Chef’s Table show).

Dan Barber’s red pepper egg

He worked with a farmer to breed super colorful red peppers, then fed a mash of them to chickens to create the red yolk you see above!  Why?  All to start a person about to eat the egg wondering how it got that red.  What did the chicken eat to make that happen?  Why have I never thought about the supply chain going into this egg before?

To me, Barber’s red pepper egg is a wonderful example of data representation as food.  The food chain data in beautifully captured in the red yolk, and it prompts you to ask questions directly aligned with his goals in presenting it.  Wonderful!

More abstract representations of data in food are like a missed opportunity to me.  The artistic merit can be there, but leaves the viewer hungry for more.  A strong mapping between the medium of your data presentation, and the data and story itself, is key to creating a lasting impression.



Practicing Data Science Responsibly

I recently gave a short talk at a Data Science event put on by Deloitte here in Boston.  Here’s a short write up of my talk.

Data science and big data driven decisions are already baked into business culture across many fields.  The technology and applications are far ahead of our reflections about intent, appropriateness, and responsibility.  I want to focus on that word here, which I steal from my friends in the humanitarian field.  What are our responsibilities when it comes to practicing data science?  Here are a few examples of why this matters, and my recommendations for what to do about it.


People Think Algorithms are Neutral

I’d be surprised if you hadn’t heard about the flare-up about Facebook’s trending news feed recently.  After breaking on Gizmodo if has been covered widely.  I don’t want to debate the question of whether this is a “responsible” example or not.  I do want to focus on what it reveals about the public’s perception of data science and technology.  People got upset, because they assumed it was produced by a neutral algorithm, and this person that spoke with Gizmodo said it was biased (against conservative news outlets).  The general public thinks algorithms are neutral, and this is a big problem.


Algorithms are artifacts of the cultural and social contexts of their creators and the world in which they operate.  Using geographic data about population in the Boston area?  Good luck separating that from the long history of redlining that created a racially segregated distribution of ownership.  To be responsible we have to acknowledge and own that fact.  Algorithms and data are not neutral third parties that operate outside of our world’s built-in assumptions and history.

Some Troubling Examples

Lets flesh this out a bit more with some examples.  First I look to Joy Boulamwini, a student colleague of mine in the Civic Media group at the MIT Media Lab.   Joy is starting to write about “InCoding” – documenting the history of biases baked into the technologies around us, and proposing interventions to remedy them. One example is facial recognition software, which has consistently been trained on white male faces; to the point where she has to literally done a white-face mask to have the software recognize her.  This just the tip of the iceberg in computer science, which has a long history of leaving out entire potential populations of users.


Another example is a classic one from Latanya Sweeney at Harvard.  In 2013 She discovered a racial bias trained into the operation Google’s AdWords platform.  When she searched for names that are more commonly given to African Americans (liked her own), the system popped up ads asking if the user wanted to do background checks or look for criminal records.  This is an example of the algorithm reflecting built-in biases of the population using it, who believed that these names were more likely to be associated with criminal activity.

My third example comes from an open data release by the New York City taxi authority.  They anonymized and then released a huge set of data about cab rides in the city.  Some enterprising researchers realized that they had done a poor job of anonymizing the taxi medallion ids, and were able to de-anonymize the dataset.  From there, Anthony Tockar was able to find strikingly juicy personal details about riders and their destinations.

A Pattern of Responsibility

Taking a step back form these three examples I see a useful pattern for thinking about what it means to practice data science with responsibility.  You need to be responsible in your data creation, data impacts, and data use.  I’ll explain each of those ideas.


Being responsible in your data collection means acknowledging the assumptions and biases baked into your data and your analysis.  Too often these get thrown away while assessing the comparative performance between various models trained by a data scientist.  Some examples where this has failed?  Joy’s InCoding example is one of course, as is the classic Facebook “social contagion” study. A more troubling one is the poor methodology used by US NSA’s SkyNet program.

Being responsible in your data impacts means thinking about how your work will operate in the social context of its publication and use.  Will the models you trained come with a disclaimer identifying the populations you weren’t able to get data from?  What are secondary impacts that you can mitigate against now, before they come back to  bite you?  The discriminatory behavior of the Google AdWords results I mentioned earlier is one example. Another is the dynamic pricing used by the Princeton Review disproportionately effecting Asian Americans.  A third are the racially correlated trends revealed in where Amazon offers same-day delivery (particularly in Boston).

Being responsible in your data use means thinking about how others could capture and use your data for their purposes, perhaps out of line with your goals and comfort zone.  The de-anonymization of NYC taxi records I mentioned already is one example of this.  Another is the recent harvesting and release of OKCupid dating profiles by researchers who considered it “public” data.

Leadership and Guidelines

The problem here is that we have little leadership and few guidelines for how to address these issues in responsible ways.  I have yet to find an handbook for a field that scaffolds how to think about these concerns. As I’ve said, the technology is far ahead of our reflections on it together.  However, that doesn’t mean that they aren’t smart people thinking about this.


In 2014 the White House brought together a team to create their report on Big Data: Seizing Opportunities, Preserving Values.  The title itself reveals their acknowledgement of the threat some of these approaches have for the public good.  Their recommendations include a number of things:

  • extending the consumer bill of rights
  • passing stronger data breach legislation
  • protecting student centered data
  • identifying discrimination
  • revising the Electronic Communications Privacy Act

Legislations isn’t strong in this area yet (at least here in the US), but be aware that it is coming down the pipe.  Your organization needs to be pro-active here, not reactive.

Just two weeks ago, the Council on Big Data, Ethics and Society released their “Perspectives” report.  This amazing group of individuals was brought together to create this report by a federal NSF grant.  Their recommendations span policy, pedagogy, network building, and area for future work.  The include things like:

  • new ethics review standards
  • data-aware grant making
  • case studies & curricula
  • spaces to talk about this
  • standards for data-sharing

These two reports are great reading to prime yourself on the latest high-level thinking coming out of more official US bodies.

So What Should We Do?

I’d synthesize all this into four recommendations for a business audience.


Define and maintain our organization’s values.  Data science work shouldn’t operate in a vacuum.  Your organizational goals, ethics, and values should apply to that work as well. Go back to your shared principles to decide what “responsible” data science means for you.

Do algorithmic QA (quality and assurance).  In software development, the QA team is separate from the developers, and can often translate between the  languages of technical development and customer needs.  This model can server data science work well.  Algorithmic QA can discover some of the pitfalls the creators of models might not.

Set up internal and and external review boards. It can be incredibly useful to have a central place where decisions are made about what data science work is responsible and what isn’t for your organization.  We discussed models for this at a recent Stanford event I was part of.

Innovate with others in your field to create norms.  This stuff is very new, and we are all trying to figure it out together.  Create spaces to meet and discuss your approaches to this with others in your industry.  Innovate together to stay ahead of regulation and legislation.

These four recommendations capture the fundamentals of how I think businesses need to be responding to the push to do data science in responsible ways.

This post is cross-posted to the civic.mit.edu website.