Difference between revisions of "Replication System"

From Dryad wiki
Jump to: navigation, search
(CLOCKSS)
(DuraCloud)
Line 138: Line 138:
 
# DuraCloud is a new and unproven technology.
 
# DuraCloud is a new and unproven technology.
 
# Cloud-based storage solutions may be too expensive for our needs.
 
# Cloud-based storage solutions may be too expensive for our needs.
 +
 +
DuraCloud recently reduced its prices.  Full Pricing Chart [http://www.duracloud.org/content/pricing]
  
 
=== Chronopolis ===
 
=== Chronopolis ===

Revision as of 13:55, 13 February 2013

Status: Simple (filesystem) replication is in place. We are working on other types of replication for greater fault-tolerance.

Data in Dryad will be replicated across multiple machines to support Failover and improve access times.

When the initial Dryad grant proposal was submitted, LOCKSS was the best candidate for a replication system. Over time, more possible systems have emerged, including projects such as DataONE (which Dryad is a member of), and experimental technologies such as DuraCloud.

Original Requirements

  1. When data is deposited in Dryad, within 1 hour it is mirrored on multiple machines. Initially, we will keep the replication simple, with all submissions being processed through the primary node -- this will be the master system.
  2. Both data and metadata must be mirrored.
  3. Data is periodically checked to make sure the copies are consistent and there exists a mechanism for correcting inconsistencies when found.
  4. The replication system must be easy to maintain over time.
  5. It must be easy to add new mirror nodes to the system.
  6. The solution must use a widely-used technology (to minimize the maintenance burden and maximize preservation). If any new pieces required for this solution, it should be possible to donate them to the DSpace community, so they can be used more widely.
  7. When a user downloads a data file, the file should be transferred from the closest mirror node.
  8. Content is synchronized as soon as it enters the Dryad submission process. (It does not need to be approved by the curator before being synchronized.)

Our primary reasons for replication are:

  1. Guarding against data loss at one location (e.g., natural disaster or hardware failure)
  2. Speeding access for users around the world. (multiple access points)

Methods Being Used or Developed

File-level syncronization

Status: Dryad currently uses file-level synchronization to backup content from the production system (NCSU) to the backup servers (NESCent). Data files are copied using the Unix rsync utility. For details, see WG:Server Setup.

Pro:

  1. Widely used and widely understood.
  2. Relatively simple.

Con:

  1. DSpace does not make this type of synchronization easy, since data files and their metadata are stored separately. It is possible for the data files and metadata to be slightly out of sync if the sync process occurs while content is being submitted.
  2. Requires synchronization of the entire database at once.
  3. Due to the above, it is difficult to migrate content "instantly" with this method.

File-level syncronization w/ database synchronization

Status: Currently used for replication from the production node to the secondary node, for Failover purposes.

Similar to the above, but rather than export/import of database content, the database content is replicated directly.

Pro:

  1. Widely used and widely understood.
  2. Relatively simple.
  3. Mitigates the cons listed above.

Con:

  1. One way replication. Manual process to replicate in the other direction after a failure of the primary node.

Technology options:

Solr indexes have one way replication built in - http://wiki.apache.org/solr/SolrReplication

Bucardo is running on the secondary node for asynchronous trigger-based replication.  It is currently set for one way, master-slave replication but supports master-master replication.  Rubyrep can be used to verify that two databases are in sync.


http://wiki.postgresql.org/wiki/Replication%2C_Clustering%2C_and_Connection_Pooling
http://www.postgresql.org/docs/current/static/different-replication-solutions.html

DataONE API

Status: Implementation is in progress.

Dryad is part of the DataONE project, which focuses on replicating content between many diverse repositories.

Pro:

  1. Dryad will be implementing the DataONE replication system regardless of any other replication decisions.
  2. Content will be dispersed across a wide geographic area.

Con:

  1. For content that is non-public (e.g., embargoed, blacked out, etc.) permissions must be managed carefully.
  2. DataONE cannot adequately replicate content that is within the Dryad submission system.

CLOCKSS

Status: Dryad is implementing support for replication through the CLOCKSS system.

LOCKSS is a system for replicating content among many sites. It was originally developed for libraries to manage their electronic journal collections. It could be adapted to work with the DSpace contents. CLOCKSS is a variant of LOCKSS with careful control over where content is stored and when the "dark" content in the archive is made public. Dryad plans to implement CLOCKSS as one of its primary preservation strategies.

For more details, see the CLOCKSS Technology page.

Alternate Methods

DSpace Import/Export

Background: DSpace allows individual records to be exported and imported. During export, a data file and its associated metadata are dumped to several files within a directory. The import process reads these files and creates a new DSpace item.

Proposed implementation: As items are approved for entry into the repository, they could be immediately exported, rsync could run, and the receiving DSpace could import.

Pro:

  1. Built into DSpace, and will be maintained long-term.
  2. Data files and their metadata are kept together.

Con:

  1. The format is not recognized by anything other than DSpace.
  2. Many steps in each synchronization process.
  3. Potential for identifiers to get out of sync?

BagIt package transfer

Background: For Dryad's handshaking with TreeBASE, the contents of a data package are exported to a BagIt package. This package may contain an OAI-ORE description of the package contents. The package is sent via HTTP to TreeBASE, which unpacks the contents and transforms them into the internal TreeBASE format.

Pro:

  1. BagIt/ORE is more standard than the DSpace export format. The format can be understood by many types of repositories.
  2. An entire data package is transferred as one piece.
  3. DataONE may adopt a similar standard.

Con:

  1. (same as for DSpace export)?
  2. Possible problems with transferring very large data packages?

iRODS

Background: iRODS is a sophisticated system for replicating data according to replication rules.

Pro:

  1. The iRODS team is located at UNC.
  2. iRODS has some support for DSpace.

Con:

  1. The Dryad team does not have any iRODS expertise.
  2. iRODS is very complex and heavyweight.

DuraCloud

Background: DuraCloud is the first collaboration between the DSpace community and the Fedora Commons community. Its purpose is to allow both DSpace and Fedora to store content in cloud-based servers (e.g., Amazon S3). DuraCloud allows content to be replicated to multiple cloud storage locations as well as local storage. It should be possible to configure multiple Dryad instances to use a single cloud-based storage layer, and maintain replicated copies on the local machines.

Pro:

  1. DuraCloud is already being supported by the DSpace community.
  2. The team behind DuraCloud has an excellent track record, and should not be underestimated.

Con:

  1. DuraCloud is a new and unproven technology.
  2. Cloud-based storage solutions may be too expensive for our needs.

DuraCloud recently reduced its prices. Full Pricing Chart [1]

Chronopolis

Background: Chronopolis is based on SRB; NCSU submits items to it using BagIt.

Pro:

  1. NCSU has some experience with it.

Con:

  1.  ???

MetaArchive

Background: ????

OAI-ORE harvesting

Con: Can't handle access-restricted items (including embargoed items)

SWORD transfer

  • will be available in DSpace 1.8