INSDC Data Submission Integration

From Dryad wiki
Jump to: navigation, search
Status: Initial Dryad-side implementation complete for a Genbank-specific handshaking mechanism. It is hidden from users because the corresponding part on the Genbank side was never implemented by NCBI. The design is currently being rethought and revised. In particular, we are broadening the scope to be more compatible with other INSDC repositories, and to support emerging minimum reporting metadata standards, and community tools for annotating data with these.

This page describes how Dryad is approaching integration of submission with data submission for molecular data repositories, specifically the INSDC sequence databases (NCBI's Genbank, EMBL's ENA, and DDBJ), whether through handshaking protocols or integration with cross-repository submission and metadata annotation tools. Also relevant in this context is NCBI LinkOut support.

Objectives

We seek to accomplish the following overall objectives by integrating with INSDC member repositories.

  1. Benefit a community of authors and users that is global.
  2. Promote adoption of and compliance with applicable standards for data and metadata quality, in particular minimum metadata reporting standards.
  3. Provide tangible benefits to Dryad stakeholders, in particular journals and users. Users include depositors, who typically are also journal authors, as well as users trying to find or access data.

Targets and Plan

To meet these objectives, our integration efforts include the following aims and targets.

  1. Improving discovery of data that are related. Different types of data from the same published study often end up in different repositories, due to existing data archiving requirements for specialized data types, in particular for sequencing data. For a user, discovering and connecting all the data related in this way to a record in one repository can be quite involved. We therefore target link-out mechanisms supported by other repositories, in particular NCBI's LinkOut for PubMed and Genbank records to connect those to their corresponding data packages and files, respectively, in Dryad. As many biological and biomedical databases link their records to published articles only by PubMed ID and not by DOI, we also aim to harvest PubMed IDs from NCBI for articles with data in Dryad, with the goal to allow 3rd party biomedical databases without a link-out mechanism to easily link to the Dryad records related to their own.
  2. Integrating with community resources for metadata annotation of sequence data according to pertinent minimum reporting standards such as the MIxS family of standards for metagenomes. As bibliographic and data authorship metadata are part of such standards and are already collected by Dryad, an integration that automatically transfers such metadata to the annotation tool benefits authors while simultaneously promoting metadata best practices during the pre-archival stages of the data lifecycle. To focus on annotation tools backed by communities of practice, we aim to collaborate with the Genomics Standards Consortium (GSC) and BioSharing communities and the annotation tools endorsed by them, such as the ISA tools suite.
  3. Achieving benefits for users of as many INSDC member repositories as possible, and at least of more than one. Submission handshaking protocols dependent on repository-specific tools will only benefit authors submitting to that specific repository. Current standalone tools are all repository and datatype-specific (such as BankIt for Genbank), and support programmatic interaction in idiosyncratic ways or not at all. The main potential tangible benefit for Dryad depositors is less redundancy in providing information, with bibliographic metadata as the primary target. Consequently, we are targeting submission portals for integration that allow API-based or file-based upload of bibliographic and data authorship metadata. Conversely, metadata specific to the sequence data is needed in much greater detail by the sequence repositories than what Dryad collects, and would thus be a useful target harvesting by Dryad after the data are published.

Design and Implementation

Integration with NCBI's Data Submission Portal

Status: On hold until the NCBI Data Submission Portal supports sequences and metagenomes, and at a minimum allows upload of bibliographic metadata through files or another automatic way (the BioSample and BioProject interfaces for describing metagenomics samples currently don't). No actual implementation has taken place so far.

NCBI is in the process of consolidating its data submission tools under a unified portal, the NCBI Data Submission Portal.

  • The portal is in active development, and only a few submission types are presently (July 2012) functional (specifically, the BioProject, BioSample, WGS, dbGaP, and GTR submission wizards). Genbank sequence submission continues to redirect to the previously existing systems, BankIt and SeqIn.
  • Within those submission types that are being supported within the portal already, support for non-interactice data and metadata upload is being added in steps. While some metadata for BioSample and BioProject can be uploaded in bulk from tabular or XML formats (including MIxS metadata), all bibliographic metadata must be added manually in painstaking detail, except for those publications for which a PubMed ID is available (which would not be the case for authors depositing into Dryad at or before the time of manuscript acceptance).

Thus, at this time integrating with NCBI's Data Submission Portal does not offer notable advantages over integrating directly with BankIt, in that for the likely most relevant datatype (sequence data) we would still need to integrate with BankIt anyway, and the Portal does not take advantage of the bibliographic metadata already known to Dryad. Nor does it automatically create a connection between the Dryad record and the deposition record at in the respective NCBI database.

Further material:

Genbank handshaking through BankIt

Status: The Dryad part of the handshaking protocol is implemented, but the Genbank part is not. Further implementation or design is currently on hold, because the devised mechanism is not well aligned with the objectives stated above.

Once complete, users would be able to use Dryad as a starting point for depositing genetic sequences in GenBank.

Key benefits:

  • Dryad's description of a publication will automatically be used to populate the BankIt tool in GenBank, saving authors time.
  • Links will be maintained between the Dryad data package and associated submissions to GenBank, allowing data re-users to easily find all content associated with a publication.

Technical documentation

The handshaking protocol was devised in collaboration with Kamen Todorov and Ilene Mizrachi of NCBI, and consists of parts to be implemented on the Dryad side and parts to be implemented on the NCBI / Genbank side. The latter is for allowing a secure way for NCBI systems to access the bibliographic metadata recorded at Dryad for the corresponding article, as well as the link to the data package in Dryad.

The Dryad end of the protocol has been implemented since July 2011 (by @mire), and technical documentation for it is available at atmire's wiki.

The NCBI / Genbank side (see below) has not been implemented yet.

Handshaking protocol design

Dryad will allow submitters easy access to GenBank via the BankIt tool. This allows submitters to work directly with GenBank staff if any issues arise, while allowing Dryad and GenBank to make the submission as seamless as possible.

Current workflow design:

  1. In the Dryad submission system, there will be a button that allows users to initiate a BankIt session
  2. When the button is pressed, Dryad will open a new window to a BankIt session (leaving the Dryad submission system open). The BankIT URL will look like http://www.ncbi.nlm.nih.gov/WebSub/?tool=genbank&dryadID=PROVISIONAL_DOI&ticket=AUTH_TOKEN
    1. TODO: GenBank implement handling for this URL
  3. BankIT verifies the ticket with Dryad, and receives an OK.
    1. The verification URL will have the form http://datadryad.org/validate?token=AUTH_TOKEN
    2. Possible HTTP responses to the verification URL are 200 OK or 403 FORBIDDEN. If the 200 OK is received, the returned document will contain the Dryad metadata. If the 403 FORBIDDEN is received, GenBank will redirect the user to http://datadryad.org/genbankHandshakeError
  4. BankIT will create an internal cookie, so they can track the Dryad ID within their system
  5. BankIT will pull the publication metadata from Dryad. This can be performed using the validation URL (above), or by using a URL of the form http://www.datadryad.org/mn/object/PROVISIONAL_DOI/dap
    1. For example: http://www.datadryad.org/mn/object/doi:10.5061/dryad.1705/dap
    2. The response will be XML that follows the format of the Dryad Application Profile.
    3. TODO: GenBank implement download of metadata
  6. Author completes submission through BankIt
  7. BankIT notifies Dryad of accession numbers
    1. TODO: How? Ryan discuss with Eugene
    2. One possibility would be for Dryad to query GenBank, and search for available items with Dryad identifiers (modified after a certain time)
  8. (up for discussion) How should Dryad send Genbank a notice when final publication info is available?

Open questions:

Contacts:

  • BankIt - Vasuki Gobu, Kamen Todorov
  • Authentication - Vladimir Soussov
  • Submission system - Eugene Yaschenko
  • Everything else - Karl Sirotkin