Replication System

From Dryad wiki
Jump to: navigation, search

Status: Data in Dryad are replicated across multiple systems to support Failover, improve access times, allow recovery from disk failures, and preserve bit integrity. We are working on other types of replication for greater fault-tolerance.

Requirements

  1. Our primary reasons for replication are guarding against service and data loss and speeding access by providing multiple access points.
  2. Initially, all submissions are to be processed through the primary (master) node at NCSU.
  3. When data is deposited in Dryad, within 1 hour it is to be mirrored on multiple machines.
  4. Both data and metadata are to be mirrored.
  5. Data is to be periodically checked to make sure the copies are consistent and there exists a mechanism for correcting inconsistencies when found.
  6. The replication system must be easy to maintain over time.
  7. It must be easy to add new mirror nodes to the system.
  8. The solution must use a widely-used technology (to minimize the maintenance burden and maximize preservation). If new code is required, it should be code that can and will be used by others in the DuraSpace/repository community.
  9. When a user downloads a data file, the file should be transferred from the closest mirror node.
  10. Content should synchronized as soon as it enters the Dryad submission process (prior to approval by curator).


Methods Being Used or Developed

File-level syncronization

Status: Dryad currently uses file-level synchronization to backup content from the production system at NCSU to the backup system at Duke. Data files are copied using the Unix rsync utility. For details, see WG:Server Setup.

Pros: Simple, widely used.

Cons: DSpace does not make this type of synchronization easy, since data files and their metadata are stored separately. It is possible for the data files and metadata to be slightly out of sync if the sync process occurs while content is being submitted. It requires synchronization of the entire database at once, thus it is difficult to migrate content on the fly.

File-level syncronization w/ database synchronization

Status: Currently used for replication from the production node to the secondary node, for Failover purposes.

Similar to the above, but rather than export/import of database content, the database content is replicated directly.

Pros: Widely used and widely understood, relatively simple, mitigates the cons of simple file synchronization.

Cons: One way replication. Manual process to replicate in the other direction after a failure of the primary node.

Technologies:

DataONE API

Status: Implementation is in progress.

Dryad is becoming a member node of DataONE, which (among other things) will replicate content among diverse repositories and provides the capacity to serve those contents from coordinating nodes through the DataONE API.

Pros: a diverse network of sites providing content backup, with retrieval failover.

Cons: For content that is non-public (e.g., embargoed, blacked out, etc.) permissions must be managed carefully. Content within the Dryad submission system cannot be backed up.

Alternate Methods

DSpace Import/Export

Background: DSpace allows individual records to be exported and imported. During export, a data file and its associated metadata are dumped to several files within a directory. The import process reads these files and creates a new DSpace item.

Proposed implementation: As items are approved for entry into the repository, they could be immediately exported, rsync could run, and the receiving DSpace could import.

Pros: Built into DSpace. Data files and their metadata are kept together.

Cons: The format is not recognized by anything other than DSpace, there are many steps in each synchronization process., and there may be potential for identifiers to get out of sync.

iRODS

Background: iRODS is a sophisticated system for replicating data according to replication rules. iRODS has some support for DSpace.

DuraCloud

Background: DuraCloud aims to allow both DSpace and Fedora to store content in cloud-based servers (e.g., Amazon S3). DuraCloud allows content to be replicated to multiple cloud storage locations as well as local storage. It should be possible to configure multiple Dryad instances to use a single cloud-based storage layer, and maintain replicated copies on the local machines. Current pricing.

SafeArchive

SafeArchive, a solution for archival storage and replication management. Designed by the Data-PASS partners, it is a storage platform for policy-driven, distributed replication of digital holdings. The current version of SafeArchive is a self-contained system that can be installed, used and maintained by institutional staff without technical expertise. The set of open source tools can easily be used by libraries, museums and archives that wish to replicate their own content.

Cloud hosting: According to their web site "SafeArchive is most easily run using Amazon Web Services (AWS). While our software is free open source software, AWS charges a fee for web hosting and data storage services." See Amazon Glacier for pricing using AWS.

Chronopolis

Based on SRB; NCSU submits items to it using BagIt.

MetaArchive

Transfer technologies

  • OAI-ORE harvesting. cannot handle access-restricted items, including embargoed items)
  • SWORD, available starting in DSpace 1.8. Dryad plans to support SWORD as part of a submission API and is evaluating SWORD for inter-repository package exchange.
  • BagIt package transfer. For Dryad's handshaking with TreeBASE, the contents of a data package are exported to a BagIt package. The package is sent via HTTP to TreeBASE, which unpacks the contents and transforms them into the internal TreeBASE format. Pros: BagIt/ORE is more standard than the DSpace export format. The format can be understood by many types of repositories. A BagIt package may contain an OAI-ORE description of the package contents. An entire data package is transferred as one piece. Cons: Issues transferring very large data packages

See Also

Failover