Jun 052011

Digital preservation used to be the affair of a few geeky keepers who recognized the value of lonely, obscure data.  But as information technology has spread across our culture, we are developing an intense, long-term relationship with digital content.

I Love Data" She Wept, by bixentro, on Flickr

"I Love Data" She Wept, by bixentro, on Flickr

Cyberspace When You’re Dead is a good example.  “Suppose that just after you finish reading this article,” begins The New York Times Magazine article, “you keel over, dead.  …what happens to [the] version of you that you’ve built with bits? Who will have access to which parts of it, and for how long?”

As ledes go, that’s sexy as hell.  It nimbly couples our mortality with our digital legacy.  Both are highly personal, endlessly fascinating and elude easy answers.

Digital legacy refers to things like your Facebook page and Twitter account, as well as the collective cultural mass on (and off) the internet.  It’s digital photographs, health records, government data and every other kind of documentation that you can think of.  The legacy keeps growing because it serves a host of compelling personal and community purposes. Yet as our digital commitment deepens, so do questions about the relationship.  Lots of average people now worry about things that used to only give archivists and librarians pangs: what pieces of the legacy should be kept?  How do we do it?  Who gets to look at it?

Angst is boiling up all over the place.  How Important Is It To Preserve Our Digital Heritage? recently asked Techdirt.  The story details the grassroots labor of love to preserve the content of Google Video, now that the Googleplex has decided to get out of that business, and similar efforts to rescue content from Friendster and GeoCities, two other defunct sites.

self-portrait: a house is not a home (2), by Marie-II, on Flickr

self-portrait: a house is not a home (2), by Marie-II, on Flickr

The people involved in these efforts are passionate amateurs–their collective nom de web is “the archive team”–who donate their time because they believe it’s the right thing to do.  But passion only takes one so far.  The article lists some of the many issues that remain in the relationship between the team and their rescued content, such as how to deploy the right technology and how to to deal with obsolete software and file formats.   Techdirt aslo asks a reasonable question: if the relationship is worth saving, why not seek professional help: “should we have, maybe even one on each continent or in each country, a modern Library of Alexandria?”

Like other issues associated with our digital preservation engagement, this question evades a simple answer.  And I’m not even talking about the fact that much of the current thought in library and archival circles is that digital preservation is best approached in a distributed manner based on collaboration among many institutions.  As the comments posted on Techdirt indicate, the big concern is trust.  Many people worry that government–the presumed benefactor of “a modern Library of Alexandria”–may not be an honest broker in terms of what is selected and how it is kept.

“I really don’t care how much is preserved as long as it’s done by private organizations as opposed to government mandate,” proclaims one commentAnother commenter states that “A third party might have a mandate to preserve as much as possible, regardless of PoV or source, whereas a government entity might be tempted to archive predominantly artifacts showing them in a favourable or neutral light.”  As of this writing there are no comments about fears of government using preserved information to violate personal rights, but that concern ripples across the minds of many people as well.

I feel safe making two predictions about the pas de deux between us and our digital legacy.  First, public attraction and attention to digital preservation will continue to expand, along with number of gigabytes we keep–and are kept about us.

Second, successfully coping with the issues attendant to the relationship between people and data will turn on communication and trust: we need additional  authorities to help plot the way forward.  Personally, I would like to see a new high-profile effort, adequately supported with public and private funds, take this on.  It would be just the ticket to strengthen a bond of faith between us and our digital content.

Jan 182011

The risk of computer hard disk failure is fairly well known: the disk crashes and your computer stops working.

Corrupt 6, by gusset, on Flickr
Corrupt 6, by gusset, on Flickr

Less well-known is a phenomenon known as silent data corruption, where an undetected error occurs in content stored on a drive.  Errors creep in from bugs in both software and hardware (firmware).  “Silent” means the drive does not report it, even in situations where special precautions such as multiple redundant disks (RAID) are used.  The problem remains unknown until you attempt to retrieve the data.

An Analysis of Data Corruption in the Storage Stack, A 2008 study of 1.53 million disk drives over a period of 41 months, found 400,000 silent errors.   That is a disturbingly high number, even though the overall percentage of bad to good data was very small.  The bad news for personal users is that the type of hard disk they are most likely to have–a SATA drive–has a failure rate that is an order of magnitude large than more expensive “enterprise class” drives.

If you have important digital data that you need to keep for a long time the best thing to do is to keep multiple copies in different places and stored on different kinds of media, even different kinds of hard disks.  As I wrote earlier, all varies of digital storage media will eventually fail, so it is essential to have replicated copies.

The more complicated issue is how to detect and fix silent errors.  The most common method is to use a checksum–an unique numerical code for each file–to find bad data.  Once an error is found the next step is to “scrub” it, a process where bad data is replaced by good data from a trusted source.

Individual users have limited choices in this regard, unfortunately.  Checksum comparisons and scrubbing require advanced knowledge and can take a long time.   Commercial data recovery services might be able to help, but they charge a hefty premium.  A cloud storage provider may–or may  not–provide the service; if it does, and if the service runs automatically in the background, this is a fact very much in its favor.

Jan 032011

Ars Technica is one of the best sources anywhere for insight into technology and its ever-expanding impact.  I was especially pleased that the site ran 10 separate articles about digital preservation during the past year.

3-D rendering of a graphene hole
3-D rendering of a graphene hole by LBNL, on Flickr

Special credit goes to John Timmer, “Science Editor et Observatory moderator.”  He wrote six excellent pieces on the challenge of preserving and providing meaningful access to scientific data.  He treats the issue superbly, bringing it to life using his real-life experience as a genetics and biology laboratory  researcher.

Timmer put together a three-part series on scientific data preservation.  Part I: Preserving science: what to do with raw research material? refers to the recent fuss about the UK’s Climatic Research Unit, particularly its messy data management.  “Poorly commented computer code. Data scattered among files with difficult-to-fathom formats…  But the chaos, confused record keeping, and data that’s gone missing-in-action sounded unfortunately familiar to many researchers, who could often supply an anecdote that started with the phrase “if you think that’s bad…”

In Part II: Preserving science: what data do we keep? What do we discard?, he tackles one of the most sensitive—and vexing—issues out there.  “The reality is that we simply can’t save everything. And, as a result, scientists have to fall back on judgment calls, both professional and otherwise, in determining what to keep and how to keep it.”

The inescapable matter of digital media obsolescence is considered in Part III: Jaz drives, spiral notebooks, and SCSI: how we lose scientific data.  “Over the course of my research career, archiving involved magneto-optical disks, a flirtation with Zip and Jaz drives (which ended when some data was lost by said drives), a return to big magneto-optical disks, and then a shift to CDs and DVDs. Interfaces also went from SCSI to Firewire to USB. Anything that wasn’t carefully moved forward to the new formats was simply left behind.”

Wired UK - NDNAD Infographic
Wired UK – NDNAD Infographic by blprnt_van, on Flickr

Timmer also weighed in on Changing software, hardware a nightmare for tracking scientific data.  “My work relied on desktop software packages that were discontinued, along with plenty of incompatible file formats. The key message is that, for even careful researchers, forces beyond their control can eliminate any chance of reproducing computerized analyses, sometimes within a matter of months.”

How science funding is putting scientific data at risk highlighted the stark reality that adequate money is all too frequently not provided to maintain important data.   Keeping computers from ending science’s reproducibility explores a huge barrier that gets in the way of confirming research results.  “Traditional science involves a complex pipeline of software tools; reproducing it will require version control for both software and data, along with careful documentation of the precise parameters used at every step.”  But “this work may run up against the issues of data preservation, as older information may reside on media that’s no longer supported or in file formats that are difficult to read.”

Doom Install Disks
Doom Install Disks by Matt Schilder, on Flickr

Ars ran two articles about preserving video games. The first, Preserving games comes with legal, technical problems referred to a paper in the International Journal of Digital Curation, Keeping the Game Alive: Evaluating Strategies for the Preservation of Console Video Games.  “Hardware becomes outdated and the media that houses game code becomes obsolete, not to mention the legal issues with emulation.”

The second, Saving “virtual worlds” from extinction, discussed Preserving Virtual Worlds, a project at the University of Illinois at Urbana-Champaign.

The final two articles focused on Library of Congress actions  (full disclosure: I work with the Library digital preservation team).  Why the Library of Congress cares about archiving our tweets delved into the huge interest that flowed from the Library’s announcement about acquiring the Twitter archives.  Historic audio at risk, thanks to bad copyright laws discussed a report from the National Recording Preservation Board about problems preserving the complex digital formats that underlie much of today’s music .

Let’s hope that Ars continues its coverage of digital preservation into 2011.  There is quite a bit to talk about.