OAI Harvester

From Dryad wiki
Jump to: navigation, search

Status: Harvester development is complete.

Goal: Content from related repositories is discoverable through Dryad.

Requirements:

  • Harvest OAI-PMH records from a set list of providers.
  • The harvester can be configured to update frequently (e.g., hourly) or infrequently (e.g., weekly, monthly), depending on the provider.
  • Content from each provider maps to a single community in Dryad.
  • Depending on the provider, content may map to a single collection or multiple collections within the community.
  • There is a hook that can transform the records to make records suitable for ingest before they are placed in Dryad.
    • This hook can execute arbitrary code, though most frequently it would apply an XSLT to the items.
    • When complex items are ingested (e.g., publication/dataset packages), the DSpace items are automatically linked.
  • The ingest processor should correctly handle updates and deletes of the OAI-PMH content.
  • Dryad searches can include or exclude items from the harvested collections. The can be accomplished through the normal DSpace search scopes, by either searching all of DSpace, or just the Dryad "Main" community.

Design thoughts:

  • It may be useful to display "other repository" content separately, similar to the way NSDL separates content from other repositories, or the way Google separates advertisements from the normal search results.

Open questions:

  • How do we handle duplicate publications? (e.g., harvest of a study from TreeBase that is equivalent to an existing publication in Dryad)
    • Can this duplication be detected? (@mire sells a tool for this purpose)
    • Should the duplicates be merged? Or should they remain separate?
  • Does it make sense to integrate this system with SWORD? (see the proposed SWORD-ORE project for the Cyberinfrastructure Summer Traineeships, which has not been implemented yet)
  • We need to keep in mind the needs of individual scientists. How many will want to search "by publication" (journal model) vs "by dataset" (genbank model)?