Difference between revisions of "TreeBASE Submission Integration"

From Dryad wiki
Jump to: navigation, search
(Progress with Handshaking (as of 7-22-10))
Line 1: Line 1:
'''Status:''' Nescent and Yale are currently finalizing the design and beginning implementation.
+
'''Status:''' Initial implementation is complete, but it will continue to be enhanced. (This page needs to be updated with current implementation details)
  
 
== Overview ==
 
== Overview ==
Line 9: Line 9:
 
This integration will be based on the following technologies:
 
This integration will be based on the following technologies:
 
* [https://wiki.ucop.edu/display/Curation/BagIt BagIt] -- A lightweight format for packaging digital content and ensuring that it is transferred intact.
 
* [https://wiki.ucop.edu/display/Curation/BagIt BagIt] -- A lightweight format for packaging digital content and ensuring that it is transferred intact.
* [http://www.openarchives.org/pmh/ OAI-PMH] -- A protocol developed by the digital library community to allow harvesting of metadata from remote repositories.  
+
* [http://www.openarchives.org/pmh/ OAI-PMH] -- A protocol developed by the digital library community to allow harvesting of metadata from remote repositories.
  
 
We are evaluating the [http://purl.org/net/sword/ SWORD] protocol to manage the transfer of BagIt packages, but we have not yet determined whether SWORD will be lightweight enough to justify its use.
 
We are evaluating the [http://purl.org/net/sword/ SWORD] protocol to manage the transfer of BagIt packages, but we have not yet determined whether SWORD will be lightweight enough to justify its use.

Revision as of 18:54, 13 October 2010

Status: Initial implementation is complete, but it will continue to be enhanced. (This page needs to be updated with current implementation details)

Overview

Authors who submit content to TreeBASE or Dryad will be have the option to make their content appear in both systems.

TreeBASE content will be searchable through Dryad, even if the author has not explicitly included the content in a Dryad data package.

This integration will be based on the following technologies:

  • BagIt -- A lightweight format for packaging digital content and ensuring that it is transferred intact.
  • OAI-PMH -- A protocol developed by the digital library community to allow harvesting of metadata from remote repositories.

We are evaluating the SWORD protocol to manage the transfer of BagIt packages, but we have not yet determined whether SWORD will be lightweight enough to justify its use.

Use cases

User submits to Dryad first

  1. User submits Nexus to Dryad, and pushes "send to TreeBASE" button.
    • Button says "I want to deposit my tree(s) in TreeBASE and enhance the description there. I realize that any annotations I create in TreeBASE will also be released under CC0." [is this consistent with the terms of reuse for TB?--Tjvision 07:26, 29 July 2010 (EDT)]
  2. Dryad pushes object to TreeBASE. (This is before the object is curated in Dryad)
    1. citation data and all uploaded nexus files are packed into a BAGIT package and pushed onto TreeBASE
    2. TreeBASE has a PUT RESTful service for receiving data (later this may be reimplemented as a SWORD service)
    3. TreeBASE only accepts the PUT if the sender's IP is within the Dryad range
    4. TreeBASE responds by returning an URL
  3. Dryad emails the user to confirm "your content was forwarded to TreeBASE". The email includes the link, saying "click on this to finalize your submission in TreeBASE". (If the link can be received quickly enough, it is also displayed within the Dryad interface).
  4. User goes to TreeBASE and completes record.
    1. Clicking on the link takes the user to a special log-in page in TreeBASE; upon logging in, TreeBASE is triggered to unpack the BAGIT and create a submission based on the contents
  5. Dryad harvests TreeBASE content.

User submits to TreeBASE first, links from Dryad record

  1. User submits/edits package in Dryad and includes a TB ID
  2. Is it already in Dryad (via harvest)?
    • If so, create internal links in Dryad
    • If not, ask the user for their access code. [why is that needed?--Tjvision 07:24, 29 July 2010 (EDT)]
    • If that doesn't work, tell them to just upload their nexus file as a separate data file
  3. Dryad adds TB ID to the record, and TB will be able to check up on it.
  4. Also add an alert for Dryad curators to follow up.

User submits to TreeBASE, Dryad harvests only

  • Treat as any other harvested content (second-class)

Process for completing a submission within TreeBASE

Minimum Requirements

  • nexus file
    • at least one tree OR at least one matrix
    • if there is a tree and a matrix, the taxon labels must match up.
    • must be "understood" by Mesquite
  • citation
  • analysis info linking matrices and trees

Detailed Process

  • create account
  • login
  • create new submission
  • type title
    • the submission gets a PURL at this point
    • the PURL can have a code added for reviewer access
  • fill in citation
    • minimum: year, title, journal name (or book/section title)
    • journal names auto-suggest as you type
  • add authors
    • minimum: at least one author (with first name and last name)
    • must always search for an existing author first, even if you know they're not in the system
    • allows reordering or deleting authors while you're in the process
  • upload file(s)
    • minimum: must be nexus, as described above
  • (optional) add notes
    • this is a textarea, with a reasonable character limit (not enough for a readme file)
  • (optional) edit details for matrices
  • (optional) edit row segment template
    • minimum: row ID, start index, end index
  • (optional) provide more details for trees
  • (optional) taxa
    • match all named taxa against ubio or ncbi
    • although the cleanup is optional, the TB editor may reject it if it's not cleaned up
  • analysis
    • minimum: create an analysis with at least one step. Typically, this will be a matrix that is processed to create one or more trees.
    • minimum: otu labels must match in the analysis steps
  • when initial submission complete, user clicks "change to ready state"
    • this triggers the curator to look at it
    • user can leave items as "in progress" as long as they want -- this is a "poor man's embargo" system

Open Questions

  1. Can Dryad records be transferred immediately, or must they be approved by a Dryad curator first? If records are transferred before curator approval, when is the permanent ID assigned?
  2. Is it possible to carry over authentication? (single sign-on) Can/should Dryad track user account info on other systems? (or will everyone move to DataONE authentication?)
  3. Does the user have to press a button to submit to TreeBASE, or could it just be automatic? If we could link the user accounts, the submission could just show us when the user logs into TreeBASE.
  4. Should TreeBASE have a "pull" method, where users logged in to TreeBASE can import content with a Dryad ID?


Random Notes

  • TB does not make content available until the associated article is published
  • (new) TB only has one identifier, which is used all the way through the process
  • TB has thousands of in-progress submissions, which are waiting for the publication to be accepted.
  • Dryad often knows that an article has been accepted, and should tell TB about this
  • TB may have an embargo process, which Dryad should use for embargoed items

(old) Workflow

NOTE: This section is outdated, and needs to be cleaned up. More details are available in the general Handshaking pages.

Whiteboard notes from the initial discussion, including integration with Dryad submissions.
  1. User submits to Dryad (and completes the submission).
  2. User is presented with a button "Also submit this content to TreeBASE"
  3. When the button is pressed, all relevant Dryad data/metadata is forwarded to TreeBASE as a SWORD package (publication becomes a TreeBASE study, each tree & matrix becomes TreeBASE data).
  4. Items are in the TreeBASE submission system, waiting for the user to finish. The user can login to TreeBASE at any time and complete the submission, adding additional information as necessary. (Or they may ignore it)
  5. When TreeBASE submission is complete Dryad picks up the submission in its next OAI-PMH harvest (from the TreeBASE OAI Provider).
  6. Dryad matches the items to existing Dryad records. Typically the matching will rely on Dryad handles being present in the records that TreeBASE serves via OAI, but matching may also rely on publication DOI, titles, or other metadata.

Relevant Text from the Grant Proposal

  • "[handshaking] so that, where required by the journal or requested by the author, data will simultaneously be deposited in Dryad and... TreeBASE."
  • "Dryad will collect any metadata required by the target database that has not already been captured, submit the pertinent data to the target database using a non-interactive programmatic gateway, and obtain the submission status, accession numbers, or possible error messages from the target database."
  • "For TreeBASE, we will design and implement a robust, web-service based submission Application Programming Interface (API). An extensive redesign of TreeBASE by the CIPRES project (www.phylo.org) is scheduled for release in 2007. However, it currently lacks a submission API. The software to be added will include the automated data validation steps that are part of the new TreeBASE submission process (e.g. validating the NEXUS format, matching terminal taxa against the uBio NameBank). When TreeBASE rejects a submission, the depositor will be notified, advised how to correct the problem, and asked to resubmit. "

Progress with Handshaking (as of 7-22-10)

The TreeBASE side of the handshaking is complete. The procedure is as follows:

  1. Dryad offers a "submit to TreeBASE" button for Dryad submissions that have phylogenetic data
  2. Clicking on the button causes Dryad to send a BAGIT package to TreeBASE via rest service: "http://www.treebase.org/treebase-web//handshaking/dryadImport"
  3. The REST service stores the BAGIT package and returns a unique URL as a string, resembling: "~/treebase-web/login.jsp?importKey=123456789"
  4. Dryad presents this URL as a link to the user and asks the user to follow it so as to finish the submission on the TreeBASE website
  5. Whoever follows the URL will have the ownership of the data. The user is confronted with a login page, including the option of creating a new account. If a new account is required, so long as the user remains within the same session (i.e. does not start a new browser), the BAGIT is unpacked and the contents become parked under a new submission that is created in the user's account.
  6. The user can open the new submission and continue filling in information. The new submission should have citation information already filled in and any NEXUS files in the BAGIT will have been parsed and stored.

Remaining Issues/Concerns:

  1. IP-based authentication needs to be configured on the Apache server so that all requests to "~/treebase-web//handshaking/dryadImport" are blocked except for those coming from Dryad machines.
  2. TreeBASE normally requires that author data include email addresses. Since email addresses are not supported in the XML data included in the BAGIT, this information is missing from the citation entry created for the user. Ideally we would like Dryad to supply email addresses for all authors. The alternative is to ask users to delete the pre-loaded author names and reenter the authors with email addresses. [Update] Dryad now sends the submitter's email in the bagit package. This isn't included in the metadata for the records but in the bag-info.txt file, which is a part of the bagit format.
  3. TreeBASE normally prefers to treat Author records as a "one" table -- i.e. each author gets only one record. This feature can only be implemented for the handshaking if email addresses are supplied by Dryad (assuming that email addresses can uniquely identify people), since first + last names are not sufficiently unique to allow the software to pick from existing matches.

Progress with Handshaking (as of 10-03-10)

There is now a script that exports Dryad data packages and requested data files into the bagit format and sends the package to the TreeBASE web service. The script is implemented as DSpace package disseminator and can be run through the standard DSpace disseminator script. For example:

/opt/dryad/bin/dspace packager -d -i 10255/dryad.630 -e submitter@email.org -o xwalk=DRYAD-V3 -o repo=TREEBASE -o 'files=10255/dryad.631;10255/dryad.632;10255/dryad.633;10255/dryad.634' -t BAGIT -

/opt/dryad/bin/dspace = the standard DSpace interface to calling scripts from the command line; this must be run as the same user as the Tomcat serving DSpace is run as.

packager -d = the specific DSpace packager script; the -d tells it to run the packager to disseminate (there isn't a bagit ingester yet, but -i would be used if there were).

-i 10255/dryad.630 = the Dryad data package's pseudo-handle (that DSpace uses as an internal identifier); this must be the pseudo-handle of a data package, not a data file.

-e submitter@email.org = the email address of the person who is making the submission to Dryad; we have this because someone must be logged in in order to submit to Dryad.

-o xwalk=DRYAD-V3 = this is a hard-coded value at the moment (will always be the same); it says to export the metadata stored in DSpace in the Dryad Application Metadata Profile format (version 3 is the latest version).

-o repo=TREEBASE = this is also hard-coded for TreeBASE; repo just indicates the remote repository to which the bagit file should be uploaded to (in the future, Dryad may send BagIt files to other repositories as well).

-o files=10255/dryad.631;10255/dryad.632;10255/dryad.633;10255/dryad.634 = the list of data files, associated with the data package, that the submitter wants exported to TreeBASE; it may not be all the data files, but all the files of a particular type (for instance, in the case of TreeBASE, Nexus files). For calling the script from the command line, this argument will probably have to be enclosed in single quotes (see the example above).

-t BAGIT = the package disseminator that should be used to export the files; in the case of TreeBASE, this is always hard-coded to BAGIT because that is the format that the TreeBASE importer supports.

- = This indicates that output should be sent to the command line; currently, the script follows the UNIX philosophy that 'no response' is an indicator of success (optionally, the code can be uncommented to allow the output of 'success' on a successful export). If there is a problem, an exception will be thrown and an email with details will be sent to the address configured as the DSpace admin.

See also: TreeBASE OAI Provider