Activity: Analyzing Text

Plain text is data too!  For analyzing text, the standard approaches of manually coding text for topics is well understood.  However, new tools are making it easy for non-programmers to try their hand at quantitative text analysis. This activity lets people analyze and present a story about the lyrics of their favorite musicians.  It introduces the idea of quantitative analysis of text, and the concept of bi-grams and tri-grams.  In addition, by using musicians’ lyrics it introduces that data analysis can be fun, and funny!

What you need:

  • big pieces of paper
  • crayons
  • computers
  • internet connectivity

How you do it:

Start off by showing some examples of lyrics-based visualizations (see the Rap Research Lab or Spotimap by Javier Arce).

Next introduce the WordCounter website and show Elvis’ most commonly used phrases:

elvis-words

The King was into loving things! Show how you can upload a text file, or type in text and get counts for the most common words, 2-word phrases, and 3-word phrases

Then introduce a lyrics website (like lyrics123.net) or give them a bunch of lyrics you have downloaded already.  Be sure to mention the copyright concerns (that’s why I can’t share a zip of lyrics on here!).  Give them the instructions for the activity:

  1. you’ll have 20 minutes
  2. you’ll work in pairs
  3. pick an artist on the website and collect their lyrics into one file (or grab a pre-downloaded zip file if you are providing them)
  4. upload the raw text into the word-counter website to count words and phrases
  5. click the arrows to download CSV results
  6. open the CSVs in Excel, Tableau, or another tool to start poking at them for a story
  7. sketch a visual presentation of that story on the big pieces of paper with crayons to share

You’ll want to give a few reminders once they have started:

  • at 5 minutes in, make sure each group has picked an artist to work on
  • At 15 minutes, make sure they have all started drawing something out on the paper

To wrap things up, give each group one minute to show their paper and present their story! While folks are sharing, you’ll naturally end up with opportunities to talk about a fwe topics of interest when it comes to text analysis:

  • normalization: if they compared two artists, they need to normalize their comparison for the size of the corpuses; TF-IDF is a great analysis technique to help figure out if a word is used more often than you’d expect
  • stemming: if they are looking for something like all the uses of “love”, including “loving” and “lover” and such, then stemming would be an appropriate next step
Analysis of who is the most narcissistic (by Stephen Seun & Mary Delaney)
Analysis of who is the most narcissistic (by Stephen Seun & Mary Delaney)

Inspirations:

This activity is inspired by Aleszu Bajak’s write up of Derek Willis’ in-class exercise.  It also pulls the lyrics-as-data from Tahir Hemphill’s Rap Research Lab workshops.

Leave a Reply

Please log in using one of these methods to post your comment:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s