Automated Publication Updater

From Dryad wiki
Jump to: navigation, search

The automated publication updater is a webapp that can be run by a curator to validate items in workflow (in review, in curation, in pub blackout) and in the archive and notify curators if new publication data is available at CrossRef.

To execute the webapp, you need to have an authorization password set up in your server's maven settings, under `<default.publication.updater.token>`.

Once this is known, run http://YOUR_DRYAD_SERVER/publication-updater/retrieve?auth=APU_TOKEN&user=SOME_USER_ID, where SOME_USER_ID can be any value unique to you. Emails should be sent to dryadassistant@datadryad.org, which is a Google Group. Curators can subscribe to that group to see the results.

Query parameters

There are several possible query parameters that can be added to the URL, like so: http://YOUR_DRYAD_SERVER/publication-updater/retrieve?PARAM=VALUE&PARAM2=VALUE2

  • auth: REQUIRED. The value for this parameter should be the APU token corresponding to that particular server. Ask a dev for the value for a particular server. This is to prevent bots from picking up the URL and running the APU without our knowledge.
  • user: REQUIRED. This should be some value that specifies who you are, so that devs can see who ran the APU in the log files. It doesn't matter what the value is, but it needs to be present.
  • issn: Adding an ISSN will run the updater only on that journal.
  • start: This parameter specifies what letter to start from in the alphabetical journal list. Helpful if the APU has stopped for any reason and one wants to resume somewhere in the middle.
  • item: Adding a specific internal item ID will run the updater only on that item. Useful for debugging. No emails are sent.

Workflow:

The updater iterates through each journal that has a journal concept with an ISSN. It first looks through the workflowItems that are packages associated with that particular journal. For each of those items, it does two things:

  • Compares the item to the journal metadata database: it makes sure that the item is up to date with the latest metadata we received from the journal.
  • If the journal provided a publication DOI, it updates the item's dc.relation.isreferencedby field.

Next, the updater checks CrossRef for publication information. If it finds information matching either the provided publication DOI or a high-score match on the authors and title, it updates the item's metadata:

  • publication DOI (dc.relation.isreferencedby)
  • create/add citation (dc.identifier.citation): if there is no journal volume, this pub is “online in advance of print.”
  • dryad.citationInProgress = true for all updated records, because a curator should manually remove this when it’s verified and checked off in the spreadsheet.
  • publication date (dc.date.issued)
  • any matching journal metadata record is updated to status "published."

CrossRef looks for items using a Solr query and returns two kinds of relevancy score: if CrossRef only has one match, it will return a score of 1.0 precisely, but if they find multiple matches, each matched item will have a floating-point relevancy score (higher is better). Currently we are using a match score of 2.0 or higher to indicate an accurate match.

For all updated items, the updater then emails dryadassistant@datadryad.org with list of updated Dryad DOIs.

For each journal, the updater also finds items in archive that need updated citations: these items either don’t have a dc.identifier.citation or have dryad.citationInProgress = true (or just present at all). It queries CrossRef as before. If anything matches, it updates the item metadata as for workflow items. It then emails dryadassistant@datadryad.org with list of updated Dryad DOIs.