Scraping for Journalism: A Guide for Collecting Data in The ProPublica Nerd Blog is a useful article about how to convert streaming and/or unstructured web data into a form more suited for manipulation. The focus is on gathering and crunching data for investigative reporting, but I believe the tools mentioned are useful for other purposes as well.
The article presents what are known as the Dollars for Docs Data Guides, which ProPublica developed as a result of their investigation of the financial ties between drug companies and doctors. According to the site, drug companies posted required data on the web in a form that was difficult to download and analyze. Staff worked with a number of free tools to capture the information and move it into structured form.
Anyone who has wrangled with large batches of text or web data understands the challenge: converting a mass of heterogeneous data into something more useful. In the world of archives, libraries and museums, this can relate to work with metadata and perhaps finding aids, as well as access efforts in general.
The five tools ProPublica used are listed below. Each link is to detailed instructions about where to get the tool and how to use it. A number of instructional videos are included.
- Using Google Refine to Clean Messy Data Google Refine, which is downloadable software, can quickly sort and reconcile the imperfections in real-world data.
- Reading Data from Flash Sites Use Firefox’s Firebug plugin to discover and capture raw data sent to your browser.
- Parsing PDFs Convert made-for-printer documents into usable spreadsheets with third-party sites or command-line utilities and some Ruby scripting.
- Scraping HTML Write Ruby code to traverse a website and copy the data you need.
- Getting Text Out of an Image-only PDF Use a specialized graphics library to break apart and analyze each piece of a spreadsheet contained in an image file (such as a scanned document).