Making Tools More Learner-Friendly

I often advise learners to be careful with what tools they choose to spend time learning.  Some powerful ones have steep learning curves, full of jargon and technical hurdles.  Others are simple and self-explanatory, but can’t do more than one thing.  I’ve been trying to find better ways to connect with tool builders and talk to them about how they need to build learner-centered tools.

Catherine D’Ignazio and I put these thoughts together into a talk for OpenVisConf this year.  This is a super-dorky conference for data viz professionals… just the place to find more tool builders to talk to!  We put together an argument that data visualization tool as informal learning spaces.  Watch the video below:

New DataBasic Tool Lets You “Connect the Dots” in Data

Catherine and I have launched a new DataBasic tool and activity, Connect the Dots, aimed at helping students and educators see how their data is connected with a visual network diagram.

By showing the relationships between things, networks are useful for finding answers that aren’t readily apparent through spreadsheet data alone. To that end, we’ve built Connect the Dots to help teach how analyzing the connections between the “dots” in data is a fundamentally different approach to understanding it.

The new tool gives users a network diagram to reveal links as well as a high level report about what the network looks like. Using network analysis helped Google revolutionize search technology and was used by journalists who investigated the connections between people and banks during the Panama Papers Leak.

Connect the Dots is the fourth and most recent addition to DataBasic, a growing suite of easy-to-use web tools designed to make data analysis and storytelling more accessible to a general and non-technical audience launched last year.

As with the previous three tools released in the DataBasic suite, Connect the Dots was designed so that its lessons can be easily planned to help students learn how to use data to tell a story. Connect the Dots comes with a learning guide and introductory video made for classes and workshops for participants from middle school through higher education. The learning guide has a 45-minute activity that walks people through an exercise in naming their favorite local restaurants and seeking patterns in the networks that result. To get started using the tool, sample data sets such as Donald Trump’s inside connections and characters from the play Les Miserables have also been included to help introduce users to vocabulary terms and the algorithms at work behind the scenes. Like the other DataBasic tools, Connect the Dots is available in English, Portuguese, and Spanish.

Learn more about Connect the Dots and all the DataBasic tools here.

Have you used DataBasic tools in your classroom, organization, or personal projects? If so, we’d love to hear your story! Write to help@databasic.io and tell us about your experience.

Thoughts on “Big Data” & “Small Data”

I’ve seen a lot of writing lately on Big Data vs. Small Data.  I know this is something I should pay attention to, because they are capitalizing words that you usually don’t capitalize! Here are some still-forming thoughts…

Rufus Pollock, Director of the Open Knowledge Foundation, recently wrote on Al Jazeera that:

Size doesn’t matter.  What matters is having the data, of whatever size, that helps us solve a problem of addresses the question we have – and for many problems and questions, Small Data is enough

He argues that Small Data is about the enabling potential of the laptop computer, combined with the communicative ability unleashed by the internet. I was sparked by his post, and others, to jot down some of my own thoughts about these newly capitalized things.

How do I Define Big Data?

Big Data is getting loads of press.  Supporters are focusing in on the idea that ginormous sets of data reveal hidden patterns and truths otherwise impossible to see.  Many critics respond that they are missing inherent biases, ignoring ethical considerations, and remind that the data never holds absolute truths.  In any case, data literacy is on people’s minds, and getting funding.

My working definition of what Big Data is focused more on the “how” of it all.  For one, most Big Data projects run on implicit, unknown, or purposely full hidden, data collection.  Cell phone providers don’t exactly advertise that they are tracking everywhere you go.  Another aspect of the “how” of Big Data is that the datasets are large enough that they require computer-assisted analysis.  You can’t sit down and draw raw Big Data on a piece of paper on a wall.  You have to use tools that perform algorithmic computations on the raw data for you.  And what do people use these tools for?  They try to describe what is going on, and they try to predict what might happen next.

So What Does Small Data Mean to Me?

Small Data is the new term many are using to argue against Big Data – as such it has a malleable definition based on each person’s goal!  For me, Small Data is the thing that community groups have always used to do their work better in a few ways:

  1. Evaluate: Groups use Small Data to evaluate programs so they can improve them
  2. Communicate: Groups use Small Data to communicate about their programs and topics with the public and the communities they serve
  3. Advocate: Groups use Small Data to make evidence-based arguments to those in power

The “how” of Small Data is very different than the ideas I laid out for Big Data.  Small Data runs on explicitely collected data – the data is collected in the open, with notice, and on purpose.  Small Data can be analyzed by interested layman.  Small Data doesn’t depend on technology-assisted analysis, but can engage it as appropriate.

So What?

Do my definitions present a useful distinction?  I imagine that is what you’re thinking right now.  Well, for me the primary difference is around the activities I can do to empower people to play with data.  My workshops and projects focus on finding stories, and telling stories, with data.  With Small Data, I have techniques for doing both.  With Big Data, I don’t have good hands-on activities for understanding how to find stories.

I connect this primarily to the fact that Big Data relies on algorithmic investigations, and I haven’t thought about how to get around that.  Algorithms aren’t hands-on.  You can do engaging activities to understand how they work, but not to actually do them.  In addition –  most of the community groups, organizations, and local governments I work with don’t have Big Data problems.

Put those two things together and you’ll see why I don’t focus on Big Data in my work. Philosophically, I want to empower people to use information to make the change they want, and right now that means using Small Data.  That’s my current thought, and guides my current focus.

Finding Data Stories

Many people have written about techniques for telling data driven stories (1).  However, I’m struggling to find a similar list of techniques to help people in finding stories in their data.  To do that you need to have a sense of what kind of data stories can be told. Here’s my current take at a few categories of data stories that can be told (expanding on earlier thoughts I had written about).  I use this list to help community groups find stories in their data that they want to tell.  Each includes a real example based on data scraped from the Somerville tree audit (the town I live in). All of these techniques benefit from existing statistical techniques that can be used to back up the pattens they illustrate.  You can find stories of factoids, connections, comparisons, changes over time, and personal connections in your data.

Factoid Stories

There’s only one Eastern Redbud tree in all of Somerville! What’s the story of that tree?  Turns out the leaves change to bright pink in fall, but everything else it yellow and orange.

An Eastern Redbush tree (from Wikipedia – not the actual tree in Somerville)

Sometimes in large sets of data you find the most interesting thing is the story of one particular point.  This could be an “outlier” (a data point not like the others) like the Redbush example above, or it could be the data point that is most common (can we tap more of the Maple trees that dominate Somerville?).  Going in depth on one particular piece of your data can be a type of data story that fascinates and surprises people.

Connection Stories

How come Somerville Ave has some many trees in the best condition? Oh, it was recently renovated… that is why those are all new trees.  There’s a story about more aesthetic outcomes of big street resurfacing projects.

a map of somerville with healthy trees in green (created in TableauPublic)
A map of somerville with healthy trees in green (created in TableauPublic)

When two aspects of your data seem related, you can tell a story about their connection.  The fancy name for this is “correlation“, and you of course need to be careful attributing causes for the connection.  That said, finding a connection between two aspects of your data can lead to a good story that connects things people otherwise don’t think about together.

Comparison Stories

Walking down Somerville Ave. gives you a good sense of the most populous trees across the city.  That street is a good representative of the tree population in the city as a whole.  Is your street different?

Comparison of tree populations in the city and along one street (large bubbles mean more trees)
Comparison of tree populations in the city and along one street (large bubbles mean more trees)

Comparing between sections of your data can a good way to find an illustrative story to tell.  Often one part of your data tells one story, but another part tells a totally different story. Or, as in this example above, maybe there is a more human slice of your data that serves as an exemplar of an overall pattern.

Stories of Change

Turns out there was a big die-off of trees in 2008.  Was the climate weird that year? (I made this up since I don’t have any time-based data)

People like thinking about things changing over time.  We experience and think about the world based on how we interact with it over time.  Telling a story a story about change over time appeals to people’s interest in understanding what caused the change.

“You” Stories

You live on Highland Rd? Did you know that ALL 9 Spruce trees in Somerville are on Highland Rd? Maybe we should rename it “Spruce Rd”?

Map of spruce trees on Highland Rd, colored by tree health (created in TableauPublic)
Map of spruce trees on Highland Rd, colored by tree health (created in TableauPublic)

Another way to find a story in data is to think about how it relates to your life.  People with map literacy like maps because they can place themselves on it.  This personalization of the story creates a connection to the real world meaning of the data and can be a powerful  type of story for small audiences.  Stories about your personal experiences can be grounding and real.

In Conclusion…

This is just one take on the type of data stories that can be told.  Please let me know how you think about this! Telling that story effectively is a whole different topic, but I find the story finding exercise much easier when I introduce a bunch of categories like this.  Most of these benefit from multiple sets of data, so remember to go data “shopping” during your story finding process.

Footnotes:

(1) For instance, I’m a huge fan of Seger and Heer’s Narrative Visualization paper, where they give a catalog of visual storytelling techniques.  Also good is Marije Rooze’s thesis work (particularly the tagged gallery of visualizations from the Guardian and New York Times).

Tools for Data Scraping and Visualization

Over the last few weeks I co-taught a short-course on data scraping and data presentation for.  It was a pleasure to get a chance to teach with Ethan Zuckerman (my boss) and interact with the creative group of students! You can peruse the syllabus outline if you like.

In my Data Therapy work I don’t usually introduce tools, because there are loads of YouTube tutorials and written tutorials.  However, while co-teaching a short-course for incoming students in the Comparative Media Studies program here at MIT, I led two short “lab” sessions on tools for data scraping, interrogation, and visualization.

There are a myriad of tools that support these efforts, so I was forced to pick just a handle to introduce to these students.  I wanted to share the short lists of tools I choose to share.

Data Scraping:

As much as possible, avoid writing code!  Many of these tools can help you avoid writing software to do the scraping.  There are constantly new tools being built, but I recommend these:

  • Copy/Paste: Never forget the awesome power of copy/paste! There are many times when an hour of copying and pasting will be faster than learning any sort of new tool!
  • Import.io: Still nascent, but this is a radical re-thinking of how you scrape.  Point and click to train their scraper.  It’s very early, and buggy, but on many simple webpages it works well!
  • Regular Expressions: Install a text editor like Sublime Text and you get the power of regular expressions (which I call “Super Find and Replace”).  It lets you define a pattern and find it in any large document.  Sure the pattern definition is cryptic, but learning it is totally worth it (here’s an online playground).
  • Jquery in the browser: Install the bookmarklet, and you can add the JQuery javascript library to any webpage you are viewing.  From there you can use a basic understanding of javascript and the Javascript console (in most browsers) to pull parts of a webpage into an array.
  • ScraperWiki: There are a few things this makes really easy – getting recent tweets, getting twitter followers, and a few others.  Otherwise this is a good engine for software coding.
  • Software Development: If you are a coder, and the website you need to scrape has javascript and logins and such, then you might need to go this route (ugh).  If so, here’s a functioning example of a scraper built in Python (with Beautiful Soup and Mechanize).  I would use Watir if you want to do this in Ruby.

Data Interrogation and Visualization:

There are even more tools that help you here.  I picked a handful of single-purpose tools, and some generic ones to share.

  • Tabula: There are  few PDF-cleaning tools, but this one has worked particularly well for me.  If your data is in a PDF, and selectable, then I recommend this! (disclosure: the Knight Foundation funds much of my paycheck, and contributed to Tabula’s development as well)
  • OpenRefine: This data cleaning tool lets you do things like cluster rows in your data that are spelled similarly, look for correlations at a high level, and more!  The School of Data has written well about this – read their OpenRefine handbook.
  • Wordle: As maligned as word clouds have been, I still believe in their role as a proxy for deep text analysis.  They give a nice visual representation of how frequently words appear in quotes, writing, etc.
  • Quartz ChartBuilder: If you need to make clean and simple charts, this is the tool for you. Much nicer than the output of Excel.
  • TimelineJS: Need an online timeline?  This is an awesome tool. Disclosure: another Knight-funded project.
  • Google Fusion Tables: This tool has empowered loads of folks to create maps online.  I’m not a big user, but lots of folks recommend it to me.
  • TileMill: Google maps isn’t the only way to make a map.  TileMill lets you create beautiful interactive maps that fit your needs. Disclosure: another Knight-funded project.
  • Tableau Public: Tableau is a much nicer way to explore your data than Excel pivot tables.  You can drag and drop columns onto a grid and it suggests visualizations that might be revealing in your attempts to find stories.

I hope those are helpful in your data scraping and story-finding adventures!

Curious for More Tools?

Keep your eye on the School of Data and Tactical Technology Collective.

Helping a Community Find Stories in Their Data

My Data Mural work has led me into a new area – actually helping community groups find the stories they want to tell in their raw data.  Until now, all my data therapy work has focused on how to present the data-driven stories more creatively.  This post shares some of the techniques I’m trying out.

Step 1: Speak like a normal person

I know, it should be obvious, but too often when entering the realm of data-anything, we fall back into using big words.  That doesn’t fly when working with community groups that don’t have a shared meaning for those words. I tried to figure out how to use regular words to talk about the types of stories that you can look for.  I came up with this set to start with:

typesofdatastories

  • comparison: you see two pieces of data that are really interesting when compared to each other
  • factoid: you see one fact that jumps out at you as particularly interesting or startling
  • connection: you see a connection between two pieces of info – you can’t say one causes another, but they’re interesting when put together
  • personal: you have a compelling story or picture that is about one person
  • change: you see one of your measures changing over time

I used regular words to describe the types of data stories in order to make the activity less intimidating to non-data people. Many people nodded their heads as I described these categories (especially at the second workshop where I spoke about them better!).  I was inspired by the Data Stories section of the Data Journalism Handbook.

Step 2: Try it out together first

To come up with a shared definition of what these types of stories meant, I showed a few data points from an amusing data set – the Somerville “Happiness Survey” (raw data).

happiness-data

We quickly tried to find stories of each type in this tiny data set.  Practicing all together on a tiny dataset can create a shared language for finding stories in data. In the breakouts that followed this activity, I could hear people using some of these words with each other to talk about the data they were looking at.

Step 3: Use less data

Usually data analysis starts with a giant set of documents.  This model doesn’t really work for a small community group made up of people that aren’t data nerds.  For our “story-finding” workshops we culled down the full data they gave us, producing a 4-page data handout for people.  Limiting the data helped the community group not be overwhelmed by the task of finding a story they wanted to tell. We definitely made some “editorial” decisions that limited the stories they could find, but we did this with the help of a smaller group of our community partners so it wasn’t arbitrary.

So how did it go?

We scaffolded the story-finding around the idea of telling a story in our “The data say____” format.  This gave us a common way to talk about the stories with each other.  Just as importantly, this forced each person to justify why they thought it was a compelling story to tell in mural form.

thedatasaySo did we build the group’s capacity for data analysis?  Our pre-post survey did NOT show a noticeable increase in people’s self-assessed ease of finding stories in data. Damn. But wait… the answer is probably more nuanced than that.  They did say they came away with more knowledge about the topic the data was about.  They also said one of the most interesting things they learned was “telling data stories”, and in each of these two pilots they came out with a data-driven story that they wanted to tell.

Is exposure to data story-finding  a sufficient outcome?  Am I trying to do too much capacity building all at once?  I’m still pondering how to do this better, so please suggest any tips!

Curious about these pilots?  You can read some more on my collaborator Emily’s Connection Lab blog:

Cross-posted to the MIT Center for Civic Media blog.