Mar 142011

Yesterday’s New York Times had an Op-Ed article about a problem I’ll bet you never thought of: The Digital Pileup.

The essence of the article, with apologies to Jimmy McMillan, is that  the amount of digital information is too damn high.  Too much energy to run all those server farms.  Too much human cost “wading through digital detritus.”  Too much money going to all those damn lawyers demanding electronic discovery.

Day 66, by Marquette La, on Flickr
Day 66, by Marquette La, on Flickr

As someone who worries about digital preservation–that is, trying to keep some digital information accessible into the future, I read this article with conflicted feelings.  On the one hand, there can be no doubt that big chunks of important digital information have disappeared.  Most of us have personal experience with losing stuff from our own personal computers.

Even more significant is that a huge percentage of our cultural knowledge and experience now lives solely in digital form.  Unless care is taken to keep and actively manage this data, we risk loosing our collective memory as well as grist for future research and discovery.  So it is unsettling–to say the least–to see digital information painted uniformly as “digital detritus,” and users depicted as data “breeders” and “hoarders.”  This is silly and simplistic.

But deep in part of my mind I get the strange sense that the author has a point.  Sources of digital information–the web, organizational records, social media, scientific databases–are huge pipes, gushing with superabundant data.  The scale and complexity of this information is well beyond the ability of individuals, and even most individual organizations, to manage.  There is so much data that many librarians and archivists are left feeling overwhelmed and perhaps even disheartened in their efforts to get a handle on preserving what is important.

The traditional model of collecting and preserving books, papers, and just about everything else rests on an assumption of scarcity: humanity has a limited capacity for documenting itself, and the portion worth keeping is much smaller still.  Methods for choosing valuable information are based on well-understood ideas about what users will appreciate and what generally will enrich creativity and learning.

All this is turned on its head in the digital age.  Humanity now has a superabundant means to document itself, and it is, at this point, hard to say with certainty which of this information has ongoing value for research or some other use.   Data mining and the ability to link different kinds of data to learn new information leads one to see potential value in just about everything.  The choice frequently boils down to keeping lots and lots of data or keeping nothing.

It can all seem too much. So, when the article asks  “is there anything we can do?” my natural optimism instinctively perked–for a split second.

Sadly, there is no silver bullet.  The author doesn’t offer much: “we can demand that our companies… aggressively engage in data reduction strategies” (none of the data I’ve cranked out, thank you) and “we can clean up the stockpiles of dead data that live around us” (I’m not yet ready to trash my 50,000 Gmail messages).

Here is a dead obvious prediction: the amount of digital data will continue to grow at a fantastic rate as the pleasure, benefit and lure of technology deepens.  And, like Rosalind in As You Like It, we will continue to ask– rhetorically, only–”why then, can one desire too much of a good thing?”

Feb 142011

What should we call our future with regard to saving and using digital information?

Billions and Billions Served, by Miss Millificent, on Flickr

Billions and Billions Served, by Miss Millificent, on Flickr

I think one common term misses the mark in conveying the true threat to data and in expressing the basic imperative for keeping it.

“Digital dark ages” is a popular term that plays on fear, and by the way, suggests that the forces of history are working against data persistence.  The phase makes for provocative paper and article titles, true, but it hasn’t leveraged adequate support. Not to mention the fact that David Rosenthal makes a compelling argument that “digital dark ages turns out to be a poor analogy for the situation we face today.”

Rosenthal, among others, points to what is in fact a completely different reality: the huge and galloping vastness of digital information.  And data will continue to grow at an incredible rate–The Economist noted last year that “information has gone from scarce to superabundant.” Far from data loss through obsolescence, the big problem is actually too much data.  The Economist notes that “the proliferation of data is making them increasingly inaccessible.”

Science has  just issued a Special Online Collection: Dealing with Data (registration required).  The introduction notes that “we have recently passed the point where more data is being collected than we can physically store,” and “even where accessible, much data in many fields is too poorly organized to enable it to be efficiently used.”  There are also references to limited funding for data curation to enable broader use or even just keeping the bits safe.

Seth Godin famously noted that people aren’t more worked up over global warming for two basic reasons.  One is the name: “global” is good and “warming” is good so how can “global warming” be bad?

The second reason is that climate change activists “have been unable tell their story with vivid images about immediate actions, it’s just human nature to avoid the issue.”  People need to have an immediate sense of any problem to focus on fixing it.

Digital preservation faces something similar.  “Digital dark ages” sounds scary at first, but the term flies in the face of the reality we confront.  Given how stressed people say they are about information overload, the prospect of data disappearing may actually sound pretty good.

We need a better way to communicate the need for digital preservation and access. In another context, Joseph Hellerstein has talked about “the industrial revolution of data,” which maybe has some possibilities.  “Data-driven” is another current term tossed around in science and technology.

Any thoughts on this?

Picture added, reformatted and tweaked for style on 2/14/2011, 4:45 pm EST