A Data Liberation Walkthrough

Notice this post originally appeared on the Code for America Labs blog

tl;dr

Get data anyway you can.
Use Refine & Recline to clean it up and put it in a couch.
Add geospatial support.
Build an app on your new API (optional).
Share it with the world.

Long version

Last time I wrote about how to build a US-wide farmers market API in under an hour. Today, I want to walk through a larger challenge to highlight some of the tools & techniques we are using at Code for America to create APIs for public data (and an example of a web app that uses the API). For this example we’re going to use the City of Boston’s Public Art website. I really like this site; they have a great map interface and a bunch of good content. The one missing piece is an API so that other developers can build on this data. That’s what we’ll build today (and just to be extra productive, we’ll setup a mobile web app that runs off the data). Also, Boston is a 2011 Code for America city and they are already taking steps to provide an API for this data, so I figure they won’t mind if we do this :) (tl;dr version).

Before we dive into this specific case, I want to point out some of the tools that all data liberators should be familiar with, along with how we are going to use them.

Google Refine - we only use it as a scraper, but if you’re not familiar with Refine, follow the link and watch the video.
Recline - as a data manager for couch documents.
GeoCouch-Utils - adds geospatial api, default map.
viewer - example of drop in CouchApp…with a poor name.

1. Getting the data

In many cases, this can be the hardest step. Fortunately, the City of Boston has a website with geolocated data for over 250 pieces of public art. We are going to look at the source code for their map and find all the juicy data. If you want to skip straight to the data (or if the website changes formatting) I’ve posted a copy of the file I used on GitHub. Be aware that this method of getting data is not a best practice and should be used only when needed (and legal).

Go to the Public Art Boston map and view the source code for the page. Copy the “markers” array from the javascript source into a file called boston_public_art.json.

source code from Public Art Boston

Tidy the data in Refine

Open boston_public_art.json in Refine.

data imported into Refine

Remove the text column - we don’t need it.

remove text column

Clean up the column names (rmt, latitude, longitude, markername, title).

Column headers before:

titles before

Column headers after:

titles after

Transform the rmt column

I’m not sure what rmt stands for, but it looks like an ID to me (except for all the ‘/0’s). Let’s go on that hunch for now and remove all of the ‘/0′s so we are left with just the IDs. We’ll do that using the Transform functionality in Refine. Transforming a cell involves running a small piece of javascript-esque code on each cell in the column. Transforms use the Google Refine Expression Language (GREL). Transforms are a powerful way to programmatically clean poorly formatted data. Our current situation, however, only requires a slight string manipulation using the replace method.

refine transform menu

refine transform dialog

Scrape more data

I want to pause for a minute and do a little recon. Let’s use one of our IDs and see where it leads us. We’ll take the first ID (276) and build a standard, out of the box, Drupal url: http://www.publicartboston.com/node/276 (it was pretty trivial to figure out this was a Drupal site). This looks like a page full of awesome data! Taking a look at the source code, we see that the data is in some sort of structured form – good news. Extracting the data from the page looks like a simple problem for a javascript framework like jQuery.

Screenshot of the structure of the page with all the data

We now find ourselves in a position where we have a list of IDs which lead to webpages containing great public art data. The next question is, how do we mash all this together? Luckily, Refine has a function to help us out. Its called “Add column by fetching URLs…”. In many cases of data clean-up, this function is used to geocode addresses, but today we will be using it to pull the HTML from a webpage into our DB.

Fetch url on the menu

Fetch url dialog

Fetch url result

Great, its done! Now we have a column called full_html with an entire webpage jammed into it. It may look awful, but it is exactly what we want. At this point, I would say that we’ve successfully acquired the data. Now let’s clean it up and make it more useful.

Note: Google Refine is a powerful tool and we’ve barely scratched the surface of what it can do. I would definitely recommend exploring it more. If you’ve got an old CSV of poorly structured data, dump it into Refine and see how easily you can clean/normalize it.

2. Clean the data

Now that we have our raw data, we can put it into our database. We are going to use CouchDB as our database (and later as our application server). I chose CouchDB to leverage its geo-spatial component (GeoCouch), to use it’s out-of-the-box REST API, and because of its schema-less document structure, it will handle data of different shapes. Another benefit of using CouchDB is that we can use CouchApps to easily add frontend interfaces to the data (i.e. Recline, Geojson-utils & the Viewer app used below). For development, I would highly recommend using the Couchbase Single Server. Just download, and install it. You will know it worked when you can go to http://localhost:5984/_utils and see Futon, CouchDB’s web-based administration console.

Setup the DB

From the Futon homepage, create a new database and call it boston_public_art.

Creating a DB in Couch

Install Recline

Recline is a cleverly named CouchApp built by Max Ogden. It provides a Refine-ish interface to data in a couch. My favorite part is that transformation functions are written in javascript (as opposed to GREL in Refine). Also, we can access the jQuery object in our transformations, which will come in handy for us. The easiest way to install Recline is via the command line.

curl -X POST http://localhost:5984/_replicate -d '{"source":"http://max.iriscouch.com/apps","target":"boston_public_art", "doc_ids":["_design/recline"]}' -H "Content-type: application/json"

Alternately, if you are familiar with CouchApps, you can download the code from GitHub and install it yourself. Once installed, go to http://127.0.0.1:5984/boston_public_art/_design/recline/_rewrite/. You will see it in an interface which resembles a minimalist version of Refine.

Screenshot of a fresh Recline install

Lets put some data in it.

Upload the data to a Couch

Now that the couch is ready, we need to move our data into it from Refine. We’ll use the Refine export templating engine to create a JSON object that can be accepted by the native CouchDB bulk documents API as illustrated below.

screenshot of refine export templating engine

Clicking “Export” will download the JSON object as a file named boston-public-art.txt. We switch to our terminal and navigate to the folder where the file was downloaded (I use ~/Downloads/ on my machine). The following command will insert all of our data into the couch (each object in our “docs” array will become one document in the couch).

curl -X POST http://127.0.0.1:5984/boston_public_art/_bulk_docs -d@boston-public-art.txt -H "Content-type: application/json"

Screenshot of recline after a the import

Note: There is another way to upload the data from Refine to CouchDB. You will need to download and install Max’s Google Refine Uploader Extension. It allows you to export data from Refine directly into a couch – basically automating the steps above.

Screenshot of Refine Uploader

After installing Refine Uploader, the “Upload…” option will appear in the “Export” menu.

Screenshot of Refine Uploader dialog box

Enter the path to your database, followed by “_bulk_docs”.

Use jQuery to extract the data

I may be getting ahead of myself. Before we can use jQuery to pull the data out of the scraped page and into the correct fields of our DB, we need to know what pieces of information we are looking for. I know that in order to use the Viewer CouchApp we need to have the following pieces of data:

_id – a unique ID for each document. The auto-generated CouchDB IDs will be fine for this.
title – the title of the piece of art
artist – name of the artist(s)
description – description of the work, artist statement, notes, etc.
discipline - preferably one of the following: sculpture, painting, photography, ceramics, fiber, architectural integration, mural, fountain, other
location_description – a human readable location for the piece
full_address - full street address, w/ city, state, zip if possible
geometry – longitude/latitude in geojson point format
image_urls - a comma delimited list (array) of urls to remote images.
data_source - the source of the data. (i.e. ‘Boston Art Commission’)
doc_type - this field is used by the Viewer CouchApp and should always be set to “artwork”

All of these bits of data are present in our scraped HTML, as well as a few others. The Viewer CouchApp we are going to setup only requires the fields above, but in the interest of building a richer API we should include all of the data available. Our additional fields will include: audio_description, collection, funders, medium, neighborhood and year.

Now that we’ve planned out what information we want, it is time to go about collecting it. A simple way to do this is to run a transform function that uses jQuery to pull the data out of our full_html field. Select “Transform” from the menu at the top of any column.

Screenshot of transform link on Recline

This should all look very similar to Refine so you won’t be surprised by the dialog box that opens next. The main difference from Refine is that now we can use any javascript functionality within our transform. The default transform function (for the ‘_id’ column) looks like this:

function(doc) {             // doc is an object representing the document (or row)
  doc['_id'] = doc['_id'];  // Do any transformations you'd like
  return doc;               // Return the transformed document
}

This process makes it easy to add new fields to your documents; just assign a value to doc['newFieldName'] and it will be added to the document. We are going to get crazy and add all of our fields at once. Here is our finished transformation function:

function(doc) {
  doc['artist'] = $(doc['full_html']).find('h3:contains(Artist)').next('div').text();
  doc['description'] = $(doc['full_html']).find('h3:contains(Description:)').filter(':first').next('div').text();
  doc['discipline'] = $(doc['full_html']).find('h3:contains(Type:)').next('div').text();
  doc['location_description'] = $(doc['full_html']).find('span.fn').text();
  doc['full_address'] = $(doc['full_html']).find('span.fn').text() +', Boston, MA';
  doc['geometry'] = { type: "Point", coordinates: [parseFloat(doc['longitude'], 10), parseFloat(doc['latitude'], 10)] };
  var temp = [];
  $(doc['full_html']).find('.node_images img').each(function(idx, el){ temp.push(el.src) });
  doc['image_urls'] = temp;
  doc['data_source'] = 'Boston Art Commission';
  doc['doc_type'] = 'artwork';
  doc['audio_description'] = $(doc['full_html']).find('h3:contains(Audio Description:)').next('div').find('a').attr('href');
  doc['collection'] = $(doc['full_html']).find('h3:contains(Collection:)').next('div').text();
  doc['funders'] = $(doc['full_html']).find('h3:contains(Funders:)').next('div').text();
  doc['medium'] = $(doc['full_html']).find('h3:contains(Medium:)').next('div').text();
  doc['neighborhood'] = $(doc['full_html']).find('h3:contains(Neighborhood:)').next('div').text();
  doc['year'] = $(doc['full_html']).find('h3:contains(Year:)').next('div').text();
  return doc;
}

Yes, a bit of refactoring would make this function a bit more efficient, but I want to be clear on where each new field value is coming from. Run the transform. Once it is finished, our data will be clean (almost). One last piece of housekeeping we should do is delete the full_html column. To do that we simply use the “Delete this column” command on the column header.

delete this column screenshot

Add geo-spatial support

Earlier, we made sure that each document in our Couch had a geojson-formatted geometry field. In order to utilize this field for geospatial queries we are going to install another CouchApp, GeoCouch-Utils. Just like the Recline CouchApp, you can either download the source from GitHub, and install it via the CouchApp command line interface, or replicate it from someone else’s couch via cURL.

curl -X POST http://localhost:5984/_replicate -d '{"source":"http://max.iriscouch.com/apps","target":"boston_public_art", "doc_ids":["_design/geo"]}' -H "Content-type: application/json"

To make sure that everything went well, point your browser at the map that comes with GeoCouch-Utils which is located at http://127.0.0.1:5984/boston_public_art/_design/geo/_rewrite/.

You should see a bunch of dots on a map of Boston like this:

GeoCouch-utils out of the box map

We can now give the API a bounding box of longitude/latitude coordinates and we will get a list of all the pieces of art within that box. For example, if we wanted all of the art in/around Boston Common, we would hit the following URL:

http://localhost:5984/boston_public_art/_design/geo/_spatial/_list/geojson/full?bbox=-71.07114,42.3519,-71.0631,42.3577

At this point we have a Geospatially-enabled REST API for all city-sanctioned public art in Boston, huzzah!

4. Setup the mobile app

I am a huge advocate for organizations (cities included) to open up their data via APIs, but I’ve found that many times the decision makers need an example of what can be built on the API. This is why I built a CouchApp with a crappy name: viewer. It is basically a CouchApp version of the MuralApp mobile website that Aaron Ogle & I built for the Open Data Philly Hackathon. Once again, you can install it manually by downloading the code from GitHub and pushing it to your couch with the CouchApp CLI, or you can just enter this command in your terminal:

curl -X POST http://user:pass@YOURCOUCH/_replicate -d '{"source":"http://mertonium.iriscouch.com/apps","target":"YOURDB", "doc_ids":["_design/viewer"]}' -H "Content-type: application/json"

Now that the viewer CouchApp is installed, you should be able to see the app live at http://localhost:5984/boston_public_art/_design/viewer/index.html and it should look something like this:

Screenshot of Public Art mobile website

Hey, look at us, we just put together a mobile website for exploring and discovering public art in Boston! Awesome.

5. Make it public

Developing locally is great, but we need to close the circle and share what we’ve made with the world. Luckily, we can make use of CouchDB’s replication feature to easily move our app & data to a [free] host.

Two of the big names in free CouchDb hosting are Iriscouch (which I use regularly) and Cloudant.

Once you’ve set up an account, create a database called boston_public_art. Then, all you need to do it replicate your local boston_public_art database to the one you just created, via the replicator in Futon (you could also do it from the command line, but beginners usually find Futon more approachable).

Screenshot of Futon replicator

After the replication you should be able to access the mobile app with the same url as before, but with your hosted domain instead of localhost:5984. For example, I’ve got the app setup at http://mertonium.iriscouch.com/boston_public_art/_design/viewer/index.html (Be sure to check it out on a mobile device too).

What’s next?

This is the point where I challenge you to go forth and liberate data with the tools and techniques listed in this post, but I do want to add one key point: data liberation is not a smash & grab process. If the data is not kept up to date, then it will lose its value and all of your hard work will be for naught. To guard against this, work with the people and organizations who hold the data. You may need to scrape data at first, but your end-game should be to come up with a system where updates are pulled into the couch automatically (or help the data-holders build an API directly into their current system).

More information

Open Data Liberation - Mark Head
DIY Open Data - Max Ogden