Harvesting Technology

From Dryad wiki
Jump to: navigation, search

Dryad harvests remote partner repositories in order to provide a search across all their contents. In some cases, Dryad has partnered with the remote repositories through a handshaking process (e.g., in the case of TreeBASE) and in other cases it hasn't (e.g., in the case of KNB). In either case, Dryad users will have their Dryad searches also performed against the contents of the remote repositories, and the search results displayed in the TabbedSearching system on the search results page.

Functionality

Dryad puts content it has harvested from partner repositories in a separate collection (one for each partner repository). It then treats that collection differently from the collections that are a part of Dryad itself (DryadLab, Data Files, Data Packages, etc.) These harvested collections are treated differently from other collections in that their contents are searched and displayed in a separate tab on the search results page.

OAI-PMH is the harvesting protocol used to harvest these partner respositories. The specification is available online, as are simple tutorials. For quick samples of how to use OAI-PMH, see the Data Access instructions.

Workflow

The workflow for harvesting involves using the standard DSpace mechanism to setup a collection with a harvested source. As explained above, Dryad uses OAI-PMH as its harvesting mechanism. There is a harvesting panel within DSpace (login required) that will enable an administrator to control which collections have a harvested source and whether they are set for active harvest. Dryad marks all collections that have a harvested source as active harvests (unless there is a problem with the remote OAI-PMH provider).

The one difference in workflow between the DSpace OAI harvester and the Dryad one is that Dryad has made some modifications to account for issues we've seen with records from TreeBASE. These changes can be seen in the OAIHarvester class that Dryad overlays. Searching the code for "TreeBASE" will locate the changes made to the source. They just represent another step in the workflow for digesting TreeBASE records. At some point in the future, they may be able to be removed and Dryad will then again use the standard DSpace OAI-PMH implementation.

Configuration

The configuration for each harvest is set in the DSpace harvest control panel

TreeBASE:

* http://www.treebase.org/treebase-web/top/oai
* Harvest all sets
* Harvest "Simple Dublin Core"
* Harvest metadata only

LTER/KNB:

* http://metacat.lternet.edu:8080/knb/dataProvider
* Harvest all sets
* Harvest "Simple Dublin Core"
* Harvest metadata only

There is an option within the harvest control panel to test a harvest. It will determine whether the service is properly configured, but may not be able to tell if there are particular issues with the harvest that would prevent it from successfully completing.

There is also a configuration option in the dspace.cfg file to allow harvesting to be suspended on Dryad restart, but this is currently set so that harvests also restart automatically.

# Determines whether the harvester scheduling process should
# be started automatically when the DSpace webapp is deployed.
# default: false
harvester.autoStart=true

Relation to DSpace

There has been very little customization of the OAI-PMH harvester. It is directly related to the the DSpace implementation.