Update of Publication Metadata

Publication metadata in Dryad needs to be updated based on an authoritative feed of metadata. In order to avoid ruining metadata in Dryad that is already correct, the process will be overseen by a curator.

Proposed workflow

 * 1) A deposit is archived in Dryad.
 * 2) The associated article is published.
 * 3) An automated system (Dryad curation task?) will run nightly to check for newly published articles. When the system identifies a new article matching data in Dryad, the data will be processed:
 * 4) * If the data is in the archive or the workflow system, the Dryad curators will receive a notification email.
 * 5) * If the data is in publication blackout, the data will be archived, and the Dryad curators will receive a notification email.
 * 6) * If the data is in the review stage, the journal editor will receive a notification email. This is an error state, since the journal should have notified us the article's status before article is published.
 * 7) The curator runs a tool (based on HAMR) to compare the Dryad metadata with the authoritative metadata, and select fields in Dryad to be updated.
 * 8) The status field in Dryad is updated to "metadata verified".

Other pieces to fit into the workflow:
 * 1) Automated search of CrossRef for records in which the DOI has been assigned but we do not yet have page numbers - autocompleting records with a  match (possibly no need for curator approval here).
 * 2) Generating a report at least weekly of records without CrossRef matches sorted oldest to youngest.
 * 3) A simple way for the curator to flag and escalate records that have been waiting an inordinate amount of time for a publication to appear (say 3 months?).  For these, a process should be triggered in which there is some investigation and possibly communication with the author or journal (exact process TBD).

CrossRef API
We will use the CrossRef API whenever possible. We expect it to be simpler than harvesting content from individual journals, and with broader coverage than PubMed.

http://api.labs.crossref.org/ + identifier + ".xml"
 * Full CrossRef documentation
 * Summary page for CrossRef APIs
 * HTTP query allows a simple "search", but most metadata must already be known to retrieve the item.
 * OpenURL query allows more sophisticated input -- it is unclear how much info is needed to get accurate responses
 * OAI-PMH may be comparable to PubMed
 * CrossRef Labs simple documentation
 * CrossRef Labs simple search
 * Sample simple search that returns XML:

Most items should be findable using the OpenURL or SIGG search. However, if these do not generate matches, we can perform an OAI-PMH lookup for the relevant timespan, and provide the best match possible from the resultant list.

PubMed
PubMed should be queried to locate any available PubMed ID. This ID should be added to the data package.

If there are problems with the implementation based on CrossRef, we will use PubMed, since this contains metadata in a standardized format for a broad range of journals. For most journals, metadata appears in PubMed within a few days of the article being published.

Direct journal monitoring
If we are unable to track article publication via CrossRef or via PubMed, we may choose to follow feeds from individual journals. This is undesirable, because it will not scale as well.

For Pensoft journals, we can use the Pensoft OAI-PMH feed, which contains full publication information in Dublin Core or MODS format.

Open questions

 * 1) What criteria should be used to determine a "match" for purposes of notifying the curator? Although in the future we should be able to use the data DOI to perform the match, initially we will need to rely on less precise information such as article title, authors, and journal name.
 * 2) For articles that have been published before the Dryad data deposit, should the metadata update be tied in to the curation process? Presumably, the submitter would have already imported the article metadata using a DOI.
 * 3) Is it more efficient to query for each item that is currently incomplete in Dryad, or to monitor newly published items and match them to the incomplete items in Dryad?

Next steps

 * 1) Consult with Dryad curator to answer the questions above.
 * 2) Flesh out the design of emails to be sent and webpages that curators will use.
 * 3) Implement the lookup of items in CrossRef and the associated email notification.
 * 4) Extend HAMR to retrieve and display metadata from CrossRef.

Related Pages
Related Fogbugz tickets:
 * https://nescent.fogbugz.com/default.asp?1443 -- main placeholder ticket for this feature

Wireframes
Update of Publication Metadata Wireframes