Difference between revisions of "Open Tree / Dryad handshake"

From Dryad wiki
Jump to: navigation, search
Line 94: Line 94:
 
== Documentation ==
 
== Documentation ==
  
In order for SWORD to work on a port other than 80, you'll need to set dspacebaseUrl in dspace.cfg to contain the port, e.g.
+
=== Basic DSpace SWORD functionality ===
  dspace.baseUrl = http://localhost:9999
+
 
 +
SWORD services are handled by the servlet defined by the dspace-sword packages. The first code to be put in action is a copy of the Library of Congress SWORD library. For a deposit, the LoC code then transfers to some generic DSpace code for authetication, and then to a format-specific ingester, which could be SWORDMETSIngester or SWORDBagItIngester depending on what the X-Packaging: header is in the POST request.
 +
 
 +
In order for SWORD to work on a port other than 80, you'll need to set dspace.baseUrl in dspace.cfg to contain the port, e.g.
 +
 
 +
  dspace.baseUrl = <nowiki>http://localhost:9999</nowiki>
  
 
Basic test of SWORD functionality:
 
Basic test of SWORD functionality:
  
   wget [http://localhost:9999/sword/servicedocument http://localhost:9999/sword/servicedocument] --user=USER --password=PASSWORD
+
   wget <nowiki>http://localhost:9999/sword/servicedocument</nowiki> --user=USER --password=PASSWORD
  
 
Don't forget that USER will typically contain an @ ...
 
Don't forget that USER will typically contain an @ ...
  
Test of METS (not BagIt) SWORD deposit: (example METS package is [[File:Sword-article.zip]])
+
A package for SWORD purposes is typically a .zip file, although DSpace could in principle support other formats. The format is specified in the Content-Type: header of the POST request.
 +
 
 +
Here is a test of METS (not BagIt) SWORD deposit: (example METS package is [[File:Sword-article.zip]])
  
 
   wget --post-file=sword-article.zip \
 
   wget --post-file=sword-article.zip \
 
  &nbsp; --header="Content-Type: application/zip" \
 
  &nbsp; --header="Content-Type: application/zip" \
  &nbsp; --header="X-Packaging: [http://purl.org/net/sword-types/METSDSpaceSIP http://purl.org/net/sword-types/METSDSpaceSIP]" \
+
  &nbsp; --header="X-Packaging: <nowiki>http://purl.org/net/sword-types/METSDSpaceSIP</nowiki>" \
 
  &nbsp; --user=USER --password=PASSWORD \
 
  &nbsp; --user=USER --password=PASSWORD \
  &nbsp; [http://localhost:9999/sword/deposit/10255/3 http://localhost:9999/sword/deposit/10255/3]
+
  &nbsp; <nowiki>http://localhost:9999/sword/deposit/10255/3</nowiki>
 +
 
 +
So far this has nothing to do with Dryad. The package is ingested as a single Item as if for DSpace, and it ends up in a workflow, not in a workspace.
 +
 
 +
=== Dryad packages ===
 +
 
 +
Dryad packages are different from generic DSPace packages like METS in that they result in multiple Items, one (without bitstreams) for the data package and one for each data file. Dryad packages follow the 'bagit' layout inside the .zip file. A bagit package is built using the 'bag' utility from LoC. It has a bunch of stuff at the top level that doesn't affect us; all the content is in the data/ directory. A Dryad package has the following structure:
 +
 
 +
data/dryadpkg.xml    Metadata for the Dryad data package as a whole
 +
data/dryadpub.xml    Metadata for the journal article (currently unused)
 +
data/dryadfile-1/file1.xml  The first data file
 +
data/dryadfile-1/dryadfile-1.xml  Metadata for the first data file
 +
data/dryadfile-2/...  Similarly
 +
 
 +
All the .xml files use the Metadata Application Profile version 3.1.  (The Bagit generator for treebase deposit uses version 3.0.)
 +
 
 +
There is a class DryadBagItIngester that handles importing a Bagit package into Dryad.
 +
 
 +
=== Metadata crosswalk ===
 +
 
 +
For Dryad package ingestion, metadata comes from three places:
 +
 
 +
1. The creator of the package (e.g. Open Tree), via the dryadpkg.xml file
 +
2. Crossref, given a DOI provided by package creator
 +
3. Entered manually during the submission process
 +
 
 +
For number 1 there is a 'crosswalk' (XSLT transform) that converts DMAP to DIM in the obvious way (currently incomplete)
  
This has nothing to do with Dryad. The package is ingested as a single Item as if for DSpace, and it ends up in a workflow, not in a workspace.
+
For number 3, the bagit package ingester leaves the data package in the submission workspace, so that the submission UI can be used to provide information like description and geographic location.
  
 
== See also ==
 
== See also ==

Revision as of 13:38, 28 March 2015

This page describes the deposit of a data package from the Open Tree of Life UI, and describes current status of the project.

Procedure

The user starts from the Open Tree curation webapp. User navigates to the right place. There is a 'deposit to Dryad' hyperlink.

  1. Open Tree displays a manifest showing what is going to be sent to Dryad.
    • NexML file
    • Study and tree metadata (publication DOI, ingroup/focal clade, method, etc)
    • The raw open tree submission files (e.g. Newick strings)
    • Supplementary files if any
  2. With the manifest display, there is a submit-like button to confirm transmission to Dryad
  3. User confirms. Following confirmation: 'Please wait while we prepare a package and send it...'
  4. Open Tree assembles a 'package' (a .zip file) for the study. The package is in BagIt format, similar to what is currently generated when Dryad does a Treebase deposit.
  5. Open Tree does a POST to Dryad using SWORD protocol. A client account/password is provided for authentication to Dryad.
  6. Dryad ingests the package and responds with a 'deposit receipt', a bit of XML, that contains a 'claim check'. (Currently the claim check is the handle assigned to the new data package, but it could be something else.)
  7. Open Tree says 'submission accepted' and offers a link to navigate to Dryad to complete the submission. The link includes the claim check as a way of identifying the submission. Open Tree also emails this link to the user, to help prevent its getting lost.
  8. The user follows the link for the claim check. Dryad asks user to log in, if not logged in already.
  9. By navigating to the package using the claim check, the package is 'claimed' by that user and becomes associated with his/her Dryad user account.
  10. Dryad submission sequence begins:
    1. User selects 'published'/'accepted'/'in prep' and confirms Dryad ToS and CC0. The DOI is prefilled from Open Tree.
    2. 'Describe publication'
      • (Publication reference info is prefilled with info from Crossref based on DOI.)
      • (Focal clade is prefilled.)
      • Submitter provides abstract, keywords, taxonomic names, geographic info, geologic eras
    3. 'Upload data files'
      • (Open Tree has already provided the files.)
      • Submitter changes / extends file title and description if desired.
      • Data file author list can be changed.
      • Addition of readme file(s) if desired.
      • (Usually no embargo)
    4. 'Review'
      • User can add more files if desired (back to step 2)
    5. 'Checkout & submit'
  11. When the submission is complete, Dryad asynchronously sends the data package DOI back to Open Tree using an Open Tree web service. Open Tree records the DOI as the "data deposit URI" associated with the study in Open Tree.
  12. (We may want a garbage collector for deposits left unclaimed for a long time.)

Notes

As I read it the Metadata Application Profile (DMAP) schema makes all of its defined fields mandatory. A lenient variant of DMAP will be needed for Open Tree, since DMAP requires a bunch of metadata fields for information that Open Tree doesn't have.

There are two versions of DMAP, 3.0 and 3.1. The Treebase handshake uses version 3.0, but we'll use version 3.1, which is simpler. Unfortunately that means packages generated for Treebase can't be parsed by the BagIt ingester used by Open Tree.

For the 'push' we will use SWORD protocol version 1.3, i.e. HTTP POST with an X-Packaging: request header. The current version of SWORD is 2.0 but (a) we don't need the features of 2.0 and (b) 2.0 is not implemented in Dryad.

Status as of 2015-03-17

Accomplished:

  • Completed survey of packaging and metadata format options: Deposit from Applications
  • Design, see above
  • Wrote a working python script that generates a BagIt package for an Open Tree study, given the study id, and submits it to Dryad using SWORD
  • Enabled the SWORD code in the Dryad code base (changes to Dryad in github fork)
  • Wrote a parser for Dryad-BagIt packages that creates Dryad data package and data file items
  • Submissions show up in the Dryad UI under 'unfinished submissions'
  • Enable modification to Dryad-BagIt submission using Dryad submission UI
    • Study how Dryad creates data packages so that their submission is 'resumable'
      • Figure out where data package creation happens
  • Metadata 'crosswalk' to bring metadata from Open Tree into Dryad (DOI, mainly)
    • Study other import crosswalks
    • Study the BagIt disseminator crosswalk (the inverse of what we need)
    • Learn enough XSLT to be able to write a crosswalk
    • Figure out where and how to invoke the crosswalk
    • Deal with incompatibilities with submission front end
  • Invoke Crossref API to fill in bibliographic metadata (title, authors, etc.) given the article DOI
  • Complete Open Tree's package creation script (aux files, tree metadata)
  • Choose between push and pull - answer is pull

Current:

  • Open Tree UI (with help from Jim Allman)
    • Enable invocation of package generation script from curation UI
    • What should the dummy account be for the SWORD deposit user/password? (Or make deposit open?)
  • Implement claim checks
    • What should the claim check be: based on workspace id, based on handle, or something else?
      • (SWORD deposit receipt contains handle, but submission UI wants workspace id)
    • If needed, service for any new URL required for activating the claim check
    • Exactly where and how to change ownership of the data package

Future:

  • Dryad to tell Open Tree about DOI (how does it know to do that, and how to do it? maybe this is not the right way to go)
  • Figure out weird bug with wrong submission step indicator in Dryad submission sequence
  • Rewrite everything for robustness, documentation, and tests

Timeline is impossible to predict, but we are targeting a deadline of 20 May 2015 for launch (SSB workshop in Ann Arbor)

Open Questions

  • JR and RS to determine how to handle publications without DOIs (why not just do what Dryad normally does?)
  • Dryad to decide (which should involve discussions with users) if it wants individual taxa in addition to the focal (higher-level) taxon. The individual taxa are useful, but there could be thousands if all are listed. (I don't see this as particularly an Open Tree thing; it's a problem common to any phylogeny deposit.)
  • JR and RS to flesh out alternative workflows and phasing for development of those (?)

Documentation

Basic DSpace SWORD functionality

SWORD services are handled by the servlet defined by the dspace-sword packages. The first code to be put in action is a copy of the Library of Congress SWORD library. For a deposit, the LoC code then transfers to some generic DSpace code for authetication, and then to a format-specific ingester, which could be SWORDMETSIngester or SWORDBagItIngester depending on what the X-Packaging: header is in the POST request.

In order for SWORD to work on a port other than 80, you'll need to set dspace.baseUrl in dspace.cfg to contain the port, e.g.

dspace.baseUrl = http://localhost:9999

Basic test of SWORD functionality:

 wget http://localhost:9999/sword/servicedocument --user=USER --password=PASSWORD

Don't forget that USER will typically contain an @ ...

A package for SWORD purposes is typically a .zip file, although DSpace could in principle support other formats. The format is specified in the Content-Type: header of the POST request.

Here is a test of METS (not BagIt) SWORD deposit: (example METS package is File:Sword-article.zip)

 wget --post-file=sword-article.zip \
  --header="Content-Type: application/zip" \
  --header="X-Packaging: http://purl.org/net/sword-types/METSDSpaceSIP" \
  --user=USER --password=PASSWORD \
  http://localhost:9999/sword/deposit/10255/3

So far this has nothing to do with Dryad. The package is ingested as a single Item as if for DSpace, and it ends up in a workflow, not in a workspace.

Dryad packages

Dryad packages are different from generic DSPace packages like METS in that they result in multiple Items, one (without bitstreams) for the data package and one for each data file. Dryad packages follow the 'bagit' layout inside the .zip file. A bagit package is built using the 'bag' utility from LoC. It has a bunch of stuff at the top level that doesn't affect us; all the content is in the data/ directory. A Dryad package has the following structure:

data/dryadpkg.xml    Metadata for the Dryad data package as a whole
data/dryadpub.xml    Metadata for the journal article (currently unused)
data/dryadfile-1/file1.xml   The first data file
data/dryadfile-1/dryadfile-1.xml  Metadata for the first data file
data/dryadfile-2/...  Similarly

All the .xml files use the Metadata Application Profile version 3.1. (The Bagit generator for treebase deposit uses version 3.0.)

There is a class DryadBagItIngester that handles importing a Bagit package into Dryad.

Metadata crosswalk

For Dryad package ingestion, metadata comes from three places:

1. The creator of the package (e.g. Open Tree), via the dryadpkg.xml file 2. Crossref, given a DOI provided by package creator 3. Entered manually during the submission process

For number 1 there is a 'crosswalk' (XSLT transform) that converts DMAP to DIM in the obvious way (currently incomplete)

For number 3, the bagit package ingester leaves the data package in the submission workspace, so that the submission UI can be used to provide information like description and geographic location.

See also