NCBI Linkout Technology

From Dryad wiki
Jump to: navigation, search

NCBI LinkOut is the mechanism by which NCBI pages link to content in Dryad. Dryad data packages are related to articles in PubMed, and to molecular sequences in GenBank's nucleotide and protein databases.

Overview

Linking out from PubMed: Every article in PubMed for which there are data in Dryad should have a link out to its respective Dryad data package. Initially, this will be restricted to those PubMed articles that also have a DOI (which should be the far majority of PubMed articles that have a Dryad data package).

Linking out from GenBank: Every sequence record in NCBI databases with LinkOut capabilities that has an article as reference (REFERENCE line in GenBank format) for which there is a data package in Dryad should have a link out to the respective data package. Initially this may be restricted to the nucleotide databases (and here to the NUCCORE database), which should, however, be the most common use-case, and may also need to be restricted to sequences for those articles that also have a DOI.

Linking out from LabsLink: Dryad also furnishes links to Europe PMC via the LabsLink service. They support a similar mechanism to NCBI LinkOut, where we provide a list of PMIDs and destination data packages.

Usage

See also: User-oriented description of NCBI LinkOut.

The DryadLinkoutTool is a standalone java application that runs on the command-line. It is not integrated into the dspace infrastructure. It connects directly to the Dryad instance's PostgreSQL database, so should be run on the database server (or a host that may connect to the database). It can be built from the source on github (using an Ant script in src/buildfiles) but requires a configuration file at runtime to connect to a dryad database (see configuration section).

The Java application exposes two classes with a command-line interface: NCBILinkoutBuilder and LabsLinkLinksBuilder. The former synchronizes PMIDs into the Dryad database and generates LinkOut XML files. The latter simply generates LabsLink XML files.

These command-line interfaces are driven by shell scripts that automate the execution, validation, uploading, and error reporting of the process:

  1. generateLinkout.sh is run without arguments and builds the default PubMed and Sequence Database Resource Mapping XML files (pubmedlinkout.xml and sequencelinkout[1-n].xml). This should only be used for local testing.
  2. generateUploadLinkout.sh takes one argument (a file containing FTP credentials), generates the same files, validates them against the LinkOut DTD using xmllint, and uploads them to the NCBI FTP site using cURL. If any command fails, errors are emailed to linkout@datadryad.org
  3. generateLabsLinkFiles.sh is run without arguments and builds the default LabsLink Mapping XML file (labslinklinkout.xml) and profile (labslinkprofile.xml).
  4. generateUploadLinkout.sh takes one argument (a file containing FTP credentials), generates the same files, validates them against the LabsLink DTD using xmllint, compresses them with gzip, and uploads them to the LabsLink FTP site using cURL. If any command fails, errors are emailed to linkout@datadryad.org

Configuration

Configuration of the command line tool requires creating/editing the file connection.properties.  A template file (template.properties) for this can be found in the src/config folder of the  repository.  Set the user, password, database and host to connect to the dryad database and save as connection.properties.  Running the ant build (src/buildfiles/build.xml) will copy this file to the appropriate location (src/build/org/datadrayd/interop).

NCBI LinkOut Workflow

On the production server, the scripts generateUploadLinkout.sh runs weekly. It is installed in crontab for the dryad user. This script regenerates the links and uploads them.

NCBI LinkOut requires at least two files:

  • providerinfo.xml -- identifies and gives information about the provider (Dryad in this case). This file is not generated by the LinkOut tool, since its contents will not change. See #Providerinfo.
  • resource mapping file (typically resources.xml but we generate pubmedlinkout.xml and sequencelinkoutN.xml). These files provide the mapping between objects in NCBI databases and Dryad URLs. A resource mapping file can be in XML format or tabular. Objects in NCBI can be identified by one or more Entrez queries, or alternatively as a list of object IDs (such as PMIDs for articles, or GI numbers for sequences). Providers (such as Dryad) are free to use any name for the mapping file (resources is the default) and may provide more than one. The files uploaded to the server must be less than 32MB. The sequencelinkout files are split by OtherTarget when they eclipse 16MB or 100,000 objects. The pubmedlinkout.xml file generated is well under this limit.

When the LinkOut tool is run: The LinkOut tool connects to a Dryad database and NCBI webservices. It generates multiple resource mapping files (one pubmedlink.xml and multiple sequencelinkoutn.xml). Currently the sequence linkouts are spread over 5 files. The LinkOut tool also adds PMIDs (dc:relation.isreferencedby) to data packages where the NCBI webservice provides a PMID but the dryad database does not contain one.

Resource mapping files should be validated against the NCBI-supplied DTD prior to upload. NCBI provides an online LinkOut File Validation Utility, but this tool has a 5MB file upload limit. The validation can also be performed locally using xmllint, which is part of libxml. xmlint has no file-size limit and is available or preinstalled on most operating systems. Validation is performed by running

xmllint -dtdvalid http://www.ncbi.nlm.nih.gov/projects/linkout/doc/LinkOut.dtd --noout file_to_validate.xml

If the command returns no output, the validation is successful. More information on the LinkOut format is available here

The XML files (identity and resource files) should then be uploaded to the NCBI FTP server. Dryad has been assigned an account and deposit directory on this server. The credentials are stored on the server in a cURL config file. Note that the server does not use Unix style commands for operations such as file deletion. Once files have been deposited, NCBI will process updates daily. When these files are processed, a message, either an acknowledgement or an error report will be sent to the designated contacts (currently Hilmar and linkout@datadryad.org). Links will appear in Pubmed and other NCBI resource pages as a result of this processing. More details on the transfer and reporting process are provided at http://www.ncbi.nlm.nih.gov/books/NBK3802/#nonbib.File_Transfer.

NCBI can provide hit statistics, on a monthly basis, for our links on request to linkout@ncbi.nlm.nih.gov.

When a user clicks on one of our links: They will be sent to the appropriate data package page.

Providerinfo

A prototype providerinfo.xml is in the Dryad github repository. Note that this is not in the DryadLinkoutTool repository.

The SubjectType and IconUrl elements can be given either in the providerinfo.xml or in the resource mapping, but not in both. Our implementation puts them into the resource mapping, see below.

LinkOut from PubMed

Resource mapping: PubMed can be queried through Entrez by DOI, like so: http://www.ncbi.nlm.nih.gov/pubmed?term=10.1111/j.1469-7580.2009.01108.x[doi]

However, it turns out that contrary to the documentation at NCBI, DOI is not permissible in a LinkOut Entrez query. Thus, the PubMed IDs corresponding to article DOIs in Dryad (i.e., for which Dryad has data packages) need to be harvested in advance, and then subsequently stored (ideally as part of the Dryad metadata) so that the resolution process doesn't have to be repeated with each update. Harvesting can be done, for example, by looking up PMIDs for article DOIs at NCBI's EUtils:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=pubmed&term=10.1111/j.1469-7580.2009.01108.x[doi]
This returns the results as an XML document (with the PMID enclosed in <id> tags).

For articles that do not have a DOI (but have data in Dryad) we can use NCBI's "citation matcher" (also for batch querying) to search PubMed and connect to the Dryad package (once we have the PMID). On the other hand, this can also be accomplished directly through Entrez, for example using journal, volume, issue, and start page for the same article as above.

DryadLogo-Button.png
A prototype resource mapping in XML format is in the Dryad subversion repository. It uses a special version of the Dryad logo that has a button-like appearance (which is what NCBI requests), and assigns the subject type of "supplemental material" to Dryad.

LinkOut from GenBank

Resource mapping: GenBank records don't include reference article DOIs as part of their metadata, so they can't be queries by DOI. Unfortunately, Entrez also doesn't support querying NCBI's nucleotide or protein databases by PubMed ID. That means that Entrez query specifications cannot be used directly for identifying the NCBI objects from which to link out.

Therefore, each object (sequence record) needs to be identified by its ID (i.e., GI number), and the mapping must instead consist of an enumeration of all such IDs. The mapping can be provided in an XML file, or in a text format. The XML format is the one of choice if the link back to Dryad can be expressed as a pattern that applies to many (or all) NCBI records to mapped; otherwise, if the mapping is a direct mapping between NCBI ID to Dryad ID (by dataset URI or DOI), the text format is possibly more straightforward to produce, albeit in the end the same tokens of information need to be provided.

Due to the lack of an appropriate Entrez query, the NCBI IDs that map to publications need to be obtained using the ELink query service (see also programming examples), which is part of NCBI's EUtils. Given a target database (such as nuccore, the core nucleotide database), a source database (such as PubMed), and one or more NCBI IDs of source records (such as PMIDs), this service allows querying for all NCBI IDs linked to the source records in the target database. The query returns the results as an XML document. For example, for querying for the sequences in the nuccore database associated with PubMed ID 21166729 (corresponding Dryad DOI):
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&db=nuccore&id=21166729

As a note, when useful the result can also be obtained within NCBI's (HTML) web-interface:
http://www.ncbi.nlm.nih.gov/sites/entrez?cmd=link&db=nucleotide&dbFrom=pubmed&from_uid=21166729
However, while it is possible to change the output of this to a plain text list of GI numbers (parameter report=gilist), the output is truncated to at most 200 IDs (controlled by parameter dispmax, which is ignored for values greater than 200). Furthermore, the dispmax and report parameters are undocumented for this form of the query.

If we wish to use accession numbers, instead of raw database ids, these can be retrieved via efetch: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=nuccore&id=316925969&rettype=acc where the id comes from the list returned by elink.  This returns a plaintext file containing versioned accession numbers (e.g., HQ413648.1).

To script generation of the mapping:

  1. Obtain a table of all article DOIs and their corresponding data package DOIs in Dryad.
2. For each article DOI:
2.1 Obtain corresponding PubMed ID from NCBI Entrez.
2.2 For each NCBI sequence database with LinkOut as target:
2.2.1 Using the PubMed ID, use ELink web-service to obtain the corresponding NCBI IDs for the target.
2.2.2 For each NCBI ID: Write mapping from NCBI ID to Dryad data package DOI.

Loop 2.2 is due to the fact that sequence data associated with a PubMed article need not necessarily be found in the core nucleotide (nuccore) database, and so other possible databases need to be attempted, too, if those support LinkOut, too. For example, the deposited sequence data for PubMed ID 21054605 (corresponding Dryad DOI) are in the Short Read Archive (SRA) rather than nuccore:
http://eutils.ncbi.nlm.nih.gov/entrez/eutils/elink.fcgi?dbfrom=pubmed&db=sra&id=21054605 (same in HTML form)
The SRA database is actually at this time not among those for which LinkOut is supported (but, according to the documentation, such support could be requested).

Dryad metadata augmentation: The algorithm above for generating the mapping NCBI sequence database records and Dryad data packages could also be used to augment the metadata in Dryad by adding the resulting PubMed IDs as article identifiers, and the resulting sequence IDs (possibly after translating to accession number using Entrez) as external database identifiers. If these were subsequently indexed, the mapping could be based on a URL pattern for the Dryad target, using the &lo.id; and &lo.pacc; entities in rules.

LabsLink LinkOut Workflow

On the production server, the scripts generateUploadLabsLink.sh runs weekly. It is installed in crontab for the dryad user. This script regenerates the links and uploads them.

LabsLink can link articles by PMID, so we furnish links as PMIDs that map back to Dryad data packages. Since the PMIDs are added to Dryad data packages as part of the NCBI LinkOut process, the LabsLink process depends on the LinkOut process being current.

LabsLink requires two files:

  • links file (labslinklinkout.xml)
  • profile file (labslinkprofile.xml)

LabsLink links file

This file is analogous to the pubmedlinkout.xml file used for NCBI. It is generated by LabsLinkLinksTarget, and simply contains an entry for each data package, its URL, and the related PMID(s).

LabsLink profile file

This file is analogous to the providerinfo.xml file used for NCBI. It is generated by LabsLinkProfileTarget, and simply contains information about Dryad as a link provider.

Relation to DSpace

Apart from queries and updates to the postgres database which implements the DSpace storage layer, the generator does not modify the operation or code of dspace. There were a few modifications made so that a search query that returns a single package as a result will go directly to the page rather than listing it as a search result. This may change if the tool is modified to use the DSpace curation tools API rather than directly querying the database.