News organizations have all but abandoned their archives

Sharon Ringel and Angela Woodall have published a comprehensive, in-depth look at the state of news archiving in the digital age, working under the auspices of the Tow Center at the Columbia Journalism Review; it's an excellent, well-researched report and paints an alarming picture of the erosion of the institutional memories of news organizations.

Ringel and Woodall find that news organizations are cavalier, even negligent, about archiving their news, and contrast this with the heyday of newspapers where dedicated librarians staffed a "morgue" of carefully clipped and cross-referenced print articles. By contrast, today's news organizations rely primarily on their CMSes, the Internet Archive's Wayback Machine, reporters' personal Google Docs accounts, and social media platforms like Twitter and Facebook to store their articles, social media posts, and other materials.

Although the Internet Archive has done yeoman service in this field, Ringel and Woodall are rightfully skeptical that a single institution should be entrusted with being the sole entity recording our collective history — not least because the Archive only saves pages it discovers in its crawls, and cannot traverse paywalls (let alone recording alternative headlines, associated social media posts, comments, personalized layouts shown to logged-in users, etc).

The authors document some nascent archiving tools that news institutions can use to move these functions back in house, and praise the New York Times's efforts to do so, but point out that there's not much movement on this front.

For all that the report covers a lot of ground — it's 17,000 words long! — it also has some omissions about the wider context of the news industry that would help make sense of this state of affairs.

First of these is market concentration and financialization: much has been made of the difficulties that news organizations have faced due to the changes (and eventual monopolization) of the ad market, but long before Craigslist, there was the corporatization of newsrooms, waves of acquisitions by financialized looters who slashed things like archiving and consolidated and reduced other news functions (this is not a relic of the distant past, either: modern newspapers are still being looted at scale by vulture capitalists).

Another key factor that doesn't get explored is the difference between entrusting a platform with a newspaper's data and using platforms as commodity storage depots for portable archives. In the former case, the platform's proprietary tools are used by a news organization, putting them at the mercy of the platform's decision to continue to support those tools. In the latter case, platforms are used to host portable systems that can be moved to rival systems, or self-hosted, on short notice. The former is a terrible idea — the latter is essential.

Finally, the authors don't give enough time to the potentials for emulation to preserve news in its original context. The authors explore how emulation can allow for the preservation of apps and other exotic and proprietary media, but far more important is the ability to run old OSes with old browsers in order to view news as it was when it was published.

The thing is, digital news has many properties that make it more archivable than print: you can send copies all over the world, all at once. Emulators let modern computers run old software and access old formats. Automated systems can run in the background, eliminating the need for people to perfectly remember to perform archiving steps. There is nothing intrinsic to "digital news" that makes it unarchivable.

All that said, this is an excellent report, and sounds an alarm about the ability of our descendants to make sense of our age.

What we found was that the majority of news outlets had not given any thought to even basic strategies for preserving their digital content, and not one was properly saving a holistic record of what it produces. Of the 21 news organizations in our study, 19 were not taking any protective steps at all to archive their web output. The remaining two lacked formal strategies to ensure that their current practices have the kind of longevity to outlast changes in technology.

Meanwhile, interviewees frequently (and mistakenly) equated digital backup and storage in Google Docs or content management systems as synonymous with archiving. (They are not the same; backup refers to making copies for data recovery in case of damage or loss, while archiving refers to long-term preservation, ensuring that records will still be available even as formatting and distribution technologies change in the future.)

Instead, news organizations have handed over their responsibilities as public stewards to third-party organizations such as the Internet Archive, Google, Ancestry, and ProQuest, which store and distribute copies of news content on remote servers. As such, the news cycle now includes reliance on proprietary organizations with increasing control over the public record. The Internet Archive aside, the larger issue is that these companies' incentives are neither journalistic nor archival, and may conflict with both.

A Public Record at Risk: The Dire State of News Archiving in the Digital Age [Sharon Ringel and Angela Woodall/Columbia Journalism Review]

(via Beyond the Beyond)