Distributed Digital Preservation
Multiple copies. Multiple locations.
The basic premise of distributed digital preservation methodologies is that risk of digital loss is mitigated by distributing copies of digital files to geographically dispersed locations. It also must preserve, not merely back-up, the files in these different locations. All DDP networks thus share particular characteristics, including that they are comprised of multiple (best practices suggest at least three) preservation sites. The following principles are also recommended, all of which are intended to reduce the chances of having any single point of failure that could impact the network:
- Sites preserving the same content should not be within a 75 to 125-mile radius of one another
- Preservation sites should be distributed beyond the typical pathways of natural disasters
- Preservation sites should be distributed across different power grids
- Preservation sites should be under the control of different systems administrators
- Content preserved in disparate sites should be on live media and should be checked on a regular basis for bit-rot and other issues
For more information and details please see A Guide to Distributed Digital Preservation.
Process
1: Prepare
Member institutions work at their own pace to prepare their content for preservation, producing packages of content according to their local needs and workflows.
For some, this includes rigorous workflows through a variety of digital preservation tools and systems, including Islandora, Archivematica, BitCurator, Fedora, Hydra, and many others.
For other institutions, this means simply packaging content using the “BagIt” specification (which is similar to a “zip” file in concept, but does not compress or change the content it packages and does produce a manifest listing all of the files it contains and the checksum of the package).
For all institutions, having a community of practice filled with thought leaders and active digital preservationists is a key benefit of membership—our members learn from each other’s experiments, workflows, challenges, and successes.
2: Ingest
The member institution produces a short entry in the MetaArchive Conspectus interface, describing the collection(s) it is submitting for ingest and submitting the package(s) of content that it has prepared. These package(s) are tested in the MetaArchive test network to ensure that they are ready for full ingest.
Once tested and approved, the network administrator selects five secure, closed-access nodes on the network to receive the content for preservation. Each of the five systems administrators who manage those five nodes ingests the content and records the successful completion of that ingest.
3: Monitor & Repair
Once content is ingested into our secure network, a voting and polling process begins. Via a sophisticated software protocol, the five preservation nodes regularly and iteratively check in with each other to make sure that all five copies of the content remain identical over time.
If a mismatch is detected between two nodes, the servers come to quorum regarding which copies are correct and which do not match, and then the network repairs the corrupted files and records that action.
4: Restore
If an institution for any reason loses its local copy of content that is preserved in MetaArchive, it simply alerts the network administrator via email and requests a preserved copy. The network administrator retrieves a copy from the network and provides it to the member institution.
Tech Specs
Tools
MetaArchive has produced a set of tools and scripts to assist members in preparing content for ingest into the preservation network, including tools to help with BagIt based ingests, and validating the quality of files before ingest.
All tools are publicly available on the MetaArchive GitHub repository.
Hosting
Members who host a server cache purchase, install, and configure a server according a regularly updated set of Technical Specifications that provide a set of basic requirements, but are designed to flexible to allow members to implement them in accordance with local institutional practices.
MA vs Cloud Storage
As a PLN, MetaArchive member content is stored on infrastructure managed and controlled by member institutions. With services utilizing cloud storage, content is ultimately stored on infrastructure controlled and operated by commercial vendors.
Security
The MetaArchive network employs rigorous security mechanisms, including SSL encrypted communication between server caches, as well as decentralization of system administrator access. Only local system admins can access local servers, protecting against overall network threats.
Private LOCKSS Networks
Within the cultural memory community, many DDP solutions rely upon the LOCKSS software in a Private LOCKSS Network (PLN) framework. A PLN is a closed group of geographically distributed servers (known as “caches” in LOCKSS terminology) that are configured to run the open source LOCKSS software package. This software makes use of the Internet to connect these caches with each other and with the websites that host content that is contributed to the network for preservation.
In PLNs, every cache has the same rights and responsibilities. There is no lead cache equipped with special powers or features. After a cache is up and running, it can continue to run even if it loses contact with the central server. Such a peer-to-peer technological structure is especially robust against failures. If any cache in the network fails, others can take over. If a cache is corrupted, any other cache in the network can be used to repair it. Since all caches are alike, the work of maintaining the network is truly distributed among all of the partners in the network. This is one of the great strengths of the distributed preservation approach.
More information on current PLNs can be found on the LOCKSS website