BagIt Handshaking

From Dryad wiki
Jump to: navigation, search

Overview

This pages describes the Dryad BagIt Handshaking module. This module enables Dryad to share data packages with other repositories using BagIt as a transfer protocol. Currently, it is used to share NEXUS data between Dryad and TreeBASE.

Workflow

Handshaking from the Dryad Perspective

There are three possible ways that Dryad may integrate with TreeBASE.

User Submits to Dryad First

  1. User submits Nexus to Dryad and selects to send the file to TreeBASE in the final stage of submission.
  2. Dryad pushes object to TreeBASE. (This is before the object is curated in Dryad)
    1. Metadata (converted to Dryad Application Profile) and all uploaded nexus files are packed into a BagIt package and pushed to TreeBASE
    2. TreeBASE has a PUT RESTful service for receiving data
    3. TreeBASE only accepts the PUT if the sender's IP is within the Dryad range
    4. TreeBASE responds by returning an URL
  3. Dryad emails the user to inform that file(s) have been uploaded to TreeBASE and includes a URL to start submission at TreeBASE
  4. User goes to TreeBASE and completes record.
    1. Upon the user logging in, TreeBASE is triggered to unpack the !BagIt package and create a submission based on the contents
  5. Dryad harvests TreeBASE content and picks up any additional relevant metadata or updated files.

User Submits to TreeBASE First, Links from Dryad Record

  1. User submits/edits package in Dryad and includes a TB ID
  2. Dryad adds TB ID to the record and creates a link to the HTML view of the TB record
  3. Dryad curator checks link to confirm that TB ID looks valid (we decided to leave ID in record if the link isn't active yet but looks like a valid TB ID).
  4. General link checking software will be run to catch dead links that have been dead for awhile.

User submits to TreeBASE, Dryad harvests only

  1. Treat as any other harvested content (second class)

Handshaking from the TreeBASE Perspective

  1. Dryad offers an option to upload files to TreeBASE for Dryad submissions that have phylogenetic data
  2. Choosing this causes Dryad to send a BagIt package to TreeBASE via rest service: http://www.treebase.org/treebase-web/handshaking/dryadImport (there is a different URL for testing -- http://treebasedb-dev.nescent.org/treebase-web)
  3. The REST service stores the BagIt package and returns a unique URL as a string, resembling: "~/treebase-web/login.jsp?importKey=123456789"
  4. Dryad emails this URL to the user and asks the user to follow it so as to finish the submission on the TreeBASE website
  5. Whoever follows the URL will have the ownership of the data. The user is confronted with a login page, including the option of creating a new account. If a new account is required, so long as the user remains within the same session (i.e. does not start a new browser), the !BagIt is unpacked and the contents become parked under a new submission that is created in the user's account.
  6. The user can open the new submission and continue filling in information. The new submission should have citation information already filled in and any NEXUS files in the !BagIt will have been parsed and stored.
  7. TreeBASE does not make content available until the associated article is published.

Process for completing a submission within TreeBASE

Minimum Requirements

  • nexus file
    • at least one tree OR at least one matrix
    • if there is a tree and a matrix, the taxon labels must match up.
    • must be "understood" by Mesquite
  • citation
  • analysis info linking matrices and trees

Detailed Process

  1. create account
  2. login
  3. create new submission
  4. type title
    1. the submission gets a PURL at this point
    2. the PURL can have a code added for reviewer access
  5. fill in citation
    1. minimum: year, title, journal name (or book/section title)
    2. journal names auto-suggest as you type
  6. add authors
    1. minimum: at least one author (with first name and last name)
    2. must always search for an existing author first, even if you know they're not in the system
    3. allows reordering or deleting authors while you're in the process
  7. upload file(s)
    1. minimum: must be nexus, as described above
  8. (optional) add notes
    1. this is a textarea, with a reasonable character limit (not enough for a readme file)
  9. (optional) edit details for matrices
  10. (optional) edit row segment template
    1. minimum: row ID, start index, end index
  11. (optional) provide more details for trees
  12. (optional) taxa
    1. match all named taxa against ubio or ncbi
    2. although the cleanup is optional, the TB editor may reject it if it's not cleaned up
  13. analysis
    1. minimum: create an analysis with at least one step. Typically, this will be a matrix that is processed to create one or more trees.
    2. minimum: otu labels must match in the analysis steps
  14. when initial submission complete, user clicks "change to ready state"
    1. this triggers the curator to look at it
    2. user can leave items as "in progress" as long as they want -- this is a "poor man's embargo" system

Debugging

If changes in the metadata break the metadata crosswalk, the first place to look should be in the DRYAD-V3 metadata crosswalk (in the crosswalks folder in the config directory). Since it can be cumbersome to test this piece, try adding a root element to the XML that is generated as a result of the crosswalk. Do this by wrapping the contents of the root XSLT template in a root element. This should be in the XSLT already and just need to be uncommented.

The next step to debugging a metadata crosswalk failure is to use the XMLVERBATIM metadata crosswalk that is also in the crosswalks folder inside of the config directory. This XSLT will just spit out the XML that is coming out of the DSpace Item to XML transformation. Looking at this should enable you to determine where the Dryad stylesheet varies from the output XML.

Running Via Script

There is a script that exports Dryad data packages and requested data files into the BagIt format and sends the BagIt package to the TreeBASE web service. The script is implemented as DSpace package disseminator and can be run through the standard DSpace disseminator script. For example:

/opt/dryad/bin/dspace packager -d -i 10255/dryad.630 -e submitter@email.org -o xwalk=DRYAD-V3 -o repo=TREEBASE -o 'files=10255/dryad.631;10255/dryad.632;10255/dryad.633;10255/dryad.634' -t BAGIT -

/opt/dryad/bin/dspace = the standard DSpace interface to calling scripts from the command line; this must be run as the same user as the Tomcat serving DSpace is run as.

packager -d = the specific DSpace packager script; the -d tells it to run the packager to disseminate (there isn't a bagit ingester yet, but -i would be used if there were).

-i 10255/dryad.630 = the Dryad data package's pseudo-handle (that DSpace uses as an internal identifier); this must be the pseudo-handle of a data package, not a data file.

-e submitter@email.org = the email address of the person who is making the submission to Dryad; we have this because someone must be logged in in order to submit to Dryad.

-o xwalk=DRYAD-V3 = this is a hard-coded value at the moment (will always be the same); it says to export the metadata stored in DSpace in the Dryad Application Metadata Profile format (version 3 is the latest version).

-o repo=TREEBASE = this is also hard-coded for TreeBASE; repo just indicates the remote repository to which the bagit file should be uploaded to (in the future, Dryad may send !BagIt files to other repositories as well).

-o files=10255/dryad.631;10255/dryad.632;10255/dryad.633;10255/dryad.634 = the list of data files, associated with the data package, that the submitter wants exported to TreeBASE; it may not be all the data files, but all the files of a particular type (for instance, in the case of TreeBASE, Nexus files). For calling the script from the command line, this argument will probably have to be enclosed in single quotes (see the example above).

-t BAGIT = the package disseminator that should be used to export the files; in the case of TreeBASE, this is always hard-coded to BAGIT because that is the format that the TreeBASE importer supports.

- = This indicates that output should be sent to the command line; currently, the script follows the UNIX philosophy that 'no response' is an indicator of success (optionally, the code can be uncommented to allow the output of 'success' on a successful export). If there is a problem, an exception will be thrown and an email with details will be sent to the address configured as the DSpace admin.

Implementation

The BagIt disseminator is implemented as a DSpace plugin.

  • It is configured in dspace.cfg
  • Java code is in modules/bagit/dspace-bagit-api/src/main/java/org/dspace/content/packager
  • XSL for transforming the metadata is in config/crosswalks/dryad-v3.xsl