Open Tree / Dryad handshake

From Dryad wiki
Jump to: navigation, search

This page describes the deposit of a data package from the Open Tree of Life UI, and describes current status of the project.

Procedure

The user starts from the Open Tree curation webapp. User navigates to the right place. There is a 'deposit to Dryad' hyperlink.

  1. Open Tree displays a manifest showing what is going to be sent to Dryad.
    • NexML file
    • README file containing study and tree metadata (publication DOI, ingroup/focal clade, method, etc)
    • The raw open tree submission files (e.g. Newick strings)
    • Supplementary files if any
  2. With the manifest display, there is a submit-like button to confirm transmission to Dryad
  3. User confirms. Following confirmation: 'Please wait while we prepare a package and send it...'
  4. Open Tree assembles a 'package' (a .zip file) for the study. The package is in BagIt format, similar to what is currently generated when Dryad does a Treebase deposit.
  5. Open Tree does a POST to Dryad using SWORD protocol. A client account/password is provided for authentication to Dryad.
  6. Dryad ingests the package and responds with a 'deposit receipt', a bit of XML, that contains a 'claim check'. (Currently the claim check is the handle assigned to the new data package, but it could be something else.)
  7. Open Tree says 'submission accepted' and offers a link to navigate to Dryad to complete the submission. The link includes the claim check as a way of identifying the submission. Open Tree also emails this link to the user, to help prevent its getting lost.
  8. The user follows the link for the claim check. Dryad asks user to log in, if not logged in already.
  9. By navigating to the package using the claim check, the package is 'claimed' by that user and becomes associated with his/her Dryad user account.
  10. Dryad submission sequence begins:
    1. User selects 'published'/'accepted'/'in prep' and confirms Dryad ToS and CC0. The DOI is prefilled from Open Tree.
    2. 'Describe publication'
      • (Publication reference info is prefilled with info from Crossref based on DOI.)
      • (Focal clade is prefilled.)
      • Submitter provides abstract, keywords, taxonomic names, geographic info, geologic eras
    3. 'Upload data files'
      • (Open Tree has already provided the files.)
      • Submitter changes / extends file title and description if desired.
      • Data file author list can be changed.
      • Addition of readme file(s) if desired.
      • (Usually no embargo)
    4. 'Review'
      • User can add more files if desired (back to step 2)
    5. 'Checkout & submit'
  11. When the submission is complete, Dryad asynchronously sends the data package DOI back to Open Tree using an Open Tree web service. Open Tree records the DOI as the "data deposit URI" associated with the study in Open Tree.
  12. (We may want a garbage collector for deposits left unclaimed for a long time.)

Notes

As I read it the Metadata Application Profile (DMAP) schema makes all of its defined fields mandatory. A lenient variant of DMAP will be needed for Open Tree, since DMAP requires a bunch of metadata fields for information that Open Tree doesn't have.

There are two versions of DMAP, 3.0 and 3.1. The Treebase handshake uses version 3.0, but we'll use version 3.1, which is simpler. Unfortunately that means packages generated for Treebase can't be parsed by the BagIt ingester used by Open Tree.

For the 'push' we will use SWORD protocol version 1.3, i.e. HTTP POST with an X-Packaging: request header. The current version of SWORD is 2.0 but (a) we don't need the features of 2.0 and (b) 2.0 is not implemented in Dryad.

Status as of 2015-03-17

Accomplished:

  • Completed survey of packaging and metadata format options: Deposit from Applications
  • Design, see above
  • Wrote a working python script that generates a BagIt package for an Open Tree study, given the study id, and submits it to Dryad using SWORD
  • Enabled the SWORD code in the Dryad code base (changes to Dryad in github fork)
  • Wrote a parser for Dryad-BagIt packages that creates Dryad data package and data file items
  • Submissions show up in the Dryad UI under 'unfinished submissions'
  • Enable modification to Dryad-BagIt submission using Dryad submission UI
    • Study how Dryad creates data packages so that their submission is 'resumable'
      • Figure out where data package creation happens
  • Metadata 'crosswalk' to bring metadata from Open Tree into Dryad (DOI, mainly)
    • Study other import crosswalks
    • Study the BagIt disseminator crosswalk (the inverse of what we need)
    • Learn enough XSLT to be able to write a crosswalk
    • Figure out where and how to invoke the crosswalk
    • Deal with incompatibilities with submission front end
  • Invoke Crossref API to fill in bibliographic metadata (title, authors, etc.) given the article DOI
  • Complete Open Tree's package creation script (aux files, tree metadata)
  • Choose between push and pull - answer is pull

Current:

  • Open Tree UI (with help from Jim Allman)
    • Enable invocation of package generation script from curation UI
    • What should the dummy account be for the SWORD deposit user/password? (Or make deposit open?)
  • Implement claim checks
    • What should the claim check be: based on workspace id, based on handle, or something else?
      • (SWORD deposit receipt contains handle, but submission UI wants workspace id)
    • If needed, service for any new URL required for activating the claim check
    • Exactly where and how to change ownership of the data package

Future:

  • Dryad to tell Open Tree about DOI (how does it know to do that, and how to do it? maybe this is not the right way to go)
  • Figure out weird bug with wrong submission step indicator in Dryad submission sequence
  • Rewrite everything for robustness, documentation, and tests

Timeline is impossible to predict, but we are targeting a deadline of 20 May 2015 for launch (SSB workshop in Ann Arbor)

Open Questions

  • JR and RS to determine how to handle publications without DOIs (why not just do what Dryad normally does?)
  • Dryad to decide (which should involve discussions with users) if it wants individual taxa in addition to the focal (higher-level) taxon. The individual taxa are useful, but there could be thousands if all are listed. (I don't see this as particularly an Open Tree thing; it's a problem common to any phylogeny deposit.)
  • JR and RS to flesh out alternative workflows and phasing for development of those (?)

Documentation

The word "package" is used in two senses here. The SWORD and DSpace sense is of a file, typically a .zip file, used for document interchange. The other sense is the Dryad sense of "data package". When I mean the latter I'll always say "data package".

Basic DSpace SWORD functionality

SWORD services are handled by the servlet defined by the dspace-sword packages. The first code to be put in action is a copy of the Library of Congress SWORD library. For a deposit, the LoC code then transfers to some generic DSpace code for authetication, and then to a format-specific ingester, which could be SWORDMETSIngester or SWORDBagItIngester depending on what the X-Packaging: header is in the POST request.

In order for SWORD to work on a port other than 80, you'll need to set dspace.baseUrl in dspace.cfg to contain the port, e.g.

dspace.baseUrl = http://localhost:9999

Basic test of SWORD functionality:

 wget http://localhost:9999/sword/servicedocument --user=USER --password=PASSWORD

Don't forget that USER will typically contain an @ ...

A package for SWORD purposes is typically a .zip file, although DSpace could in principle support other formats. The format is specified in the Content-Type: header of the POST request.

Here is a test of METS (not BagIt) SWORD deposit: (the example METS package is File:Sword-article.zip, origin unknown)

 wget --post-file=sword-article.zip \
  --header="Content-Type: application/zip" \
  --header="X-Packaging: http://purl.org/net/sword-types/METSDSpaceSIP" \
  --user=USER --password=PASSWORD \
  http://localhost:9999/sword/deposit/10255/3

I've tried doing this using 'curl' but it doesn't work for me.

The response to the POST is a 'deposit receipt', a bit of XML that describes the deposit. It contains an identifier for the deposit so that when we go to DSpace we can find it again. Currently the identifier is a handle. I don't know what would be involved in changing that to some other kind of identifier.

So far this has nothing to do with Dryad. The package is ingested as a single Item as if for DSpace, and it ends up in a workflow, not in a workspace.

Authentication

The LoC SWORD services assume the presence of a username and password in the HTTP headers. These values are parsed and passed through to DSpace. The actual authentication seems to happen in the last SWORDAuthenticator.authenticate() method definition. To switch to using API keys, the easiest hack would be to subvert the username value for use as an API key, and modify SWORDAuthenticator accordingly.

Dryad packages

Dryad packages are different from generic DSPace packages like METS in that they result in multiple Items, one (without bitstreams) for the data package and one for each data file. Dryad packages follow the 'bagit' layout inside the .zip file. A bagit package is built using the 'bag' utility from LoC. It has a bunch of stuff at the top level that doesn't affect us; all the content is in the data/ directory. A Dryad package has the following structure:

data/dryadpkg.xml    Metadata for the Dryad data package as a whole
data/dryadpub.xml    Metadata for the journal article (currently unused)
data/dryadfile-1/file1   The first data file
data/dryadfile-1/dryadfile-1.xml  Metadata for the first data file
data/dryadfile-2/...  Similarly

All the .xml files use the Dryad Metadata Application Profile version 3.1, with all fields optional. (The Bagit generator for treebase deposit uses version 3.0.)

There is a class DryadBagItIngester that handles importing a Bagit package into Dryad.

Example Dryad/bagit deposit:

wget --post-file=dspace/modules/bagit/dspace-bagit-api/src/test/resources/2850-bagit.zip \
  --header="Content-Type: application/zip" \
  --header="X-Packaging: http://purl.org/net/sword-types/bagit" \
  --header="X-Verbose: true" \
  --header="X-No-Op: false" \
  --user=jar386@mumble.net --password=foo \
  http://localhost:9999/sword/deposit/10255/3

The example .zip comes from Open Tree.

Metadata sources

For Dryad package ingestion, metadata comes from three places:

  1. The creator of the package (e.g. Open Tree), via the dryadpkg.xml file
  2. Crossref, given a DOI provided by package creator
  3. Entered manually during the submission process

For number 1 there is a 'crosswalk' (XSLT transform) that converts DMAP to DIM in the obvious way (currently incomplete)

For number 3, the bagit package ingester leaves the data package in the submission workspace, so that the submission UI can be used to provide information like description and geographic location.

Packages coming from Open Tree will usually, but not necessarily, specify that a CC0 waiver applies. They will usually, but necessarily, provide a DOI for the publication. If it would help, these properties could be ensured on the Open Tree side.

Association with a user

An incoming data package is initially owned by whatever DSpace user was specified in the HTTP headers. Since the depositor doesn't (and shouldn't) know about all Dryad users or prompt for Dryad credentials, we need some way to transfer ownership of the data package to the right user. It is proposed to do this with a 'claim' system.

There will be a new Dryad HTTP service for 'claiming' a data package. When provided with some hard-to-guess token (e.g. the handle received in the SWORD deposit receipt, see above) that's been associated with the data package, the ownership of the data package is transferred to the user authenticated to Dryad for this service (in the usual way for authenticating a user in a session). If that user has already claimed the data package, no new action is taken. If some other user has claimed it, this should be considered an error. (If the submitter really wishes to switch Dryad user accounts for a submission, they will have to re-submit.)

This is not a situation where a high level of security is needed, so using a handle as a hard-to-guess token might be OK.

Once data package ownership is successfully transferred to the user, the submission sequence should be entered.

Entry into submission sequence

Once the data package is claimed, it will be an "unfinished submission" in Dryad. Currently an attempt to continue the submission, either from the 'claim' service or via the 'unfinished submissions' list, will enter at the page you see when you say you're all done entering data files. It would be better if one entered the submission sequence at the beginning, so that metadata can be checked, added, updated, etc.

See also