Journal Metadata Processing Technology

From Dryad wiki
Jump to: navigation, search

Overview

The Dryad journal-submit webapp imports metadata that is mailed from integrated journals to Dryad. The email is parsed for article publication metadata and that metadata is used to populate the form that the submitter sees when he or she comes to Dryad and uses either the journal-provided link or manually enters a manuscript number that Dryad has metadata for.

Journal Email Workflow

  1. Journal sends email with article metadata to the journal-submit@datadryad.org email address, which forwards to journal-submit-app@datadryad.org
  2. Messages in journal-submit-app@datadryad.org are automatically labeled by gmail with a special label.
  3. At a regular interval, the journal-submit webapp retrieves the newest emails with the special label, saving those messages to process further and removing the label. The specific label is set by the maven settings on the server, so each Dryad server can react to a different label.
  4. The webapp processes the new emails: it gets the byte stream with the email content and detects the journal's name.
  5. The webapp then looks up the journal name in the journal authority control, to learn the `parsingScheme` to use. The value of this attribute is used to match on the parsing classes in the journal-submit's codebase.
  6. When a particular parser is determined to be the correct one to use, the webapp uses that to parse the remainder of the email, putting metadata values into a ParsingResult object.
  7. The ParsingResult class is then used to export the metadata into an XML file; the location of this file is determined by the `metadataDir` value from the journal authority control.
  8. Once this file is written (using the manuscript number as the name of the file), the journal-submit webapp is done with its process and control moves to the standard DSpace/Dryad submission process, which uses the written XML file to populate the submission form when the author comes to Dryad to complete the submission process.
  9. Later, a submitter will use the Submission System to retrieve the parsed metadata and initiate a new data submission.

If an email fails to parse properly because it's malformed (see Journal Metadata for proper email formatting), it will be tagged with the server's gmail error label.

Users with access to the journal-submit-app gmail account can also manually add and remove the gmail labels to re-process specific emails. To trigger the journal-submit webapp manually, use the url http://whatever.server/journal-submit/retrieve.

Configuration

Overview

There are two components to the workflow: the journal-submit webapp, which runs on the server, and the Gmail account that the webapp listens to. Both can be tweaked separately.

Configuring the journal-submit webapp

Several settings need to be in the server's Maven settings.xml file. The following are example settings for the main production server. Please choose unique settings for your own server's label and error label, e.g. dev-journal-submit and dev-journal-submit-error for the dev server.

The JSON data for the clientsecret can be found in the Maven settings on dev or production, or can be obtained directly from Google via https://console.developers.google.com/project/journal-submit/apiui/credential and is the downloaded JSON for "Client ID for native application." 


<!-- Journal-submit webapp -->
<!-- the gmail account to listen to: -->
<default.submit.journal.email>journal-submit-app@datadryad.org</default.submit.journal.email>

<!-- the gmail label to look for: -->
<default.submit.journal.label>journal-submit</default.submit.journal.label>

<!-- the gmail label applied to messages that fail to parse properly: -->
<default.submit.journal.error.label>journal-submit-error</default.submit.journal.error.label>

<!-- how frequently the webapp should retrieve messages from gmail, in seconds: -->
<default.submit.journal.email.timer>600</default.submit.journal.email.timer>

<!-- clientsecret for authorization of the gmail account. -->
<!-- The JSON can be obtained at https://console.developers.google.com/project/journal-submit/apiui/credential -->
<!-- and is the downloaded JSON for "Client ID for native application." -->
<default.submit.journal.clientsecret>{JSON GOES HERE}</default.submit.journal.clientsecret>

Once the settings have been updated, rebuild and deploy your server.

Authorizing the webapp

This should only need to be done when a server is deploying the webapp for the very first time: the credentials should remain authorized unless and until someone revokes the access through the Google Developer Console.

Once configured, built, and running, start the Tomcat instance and authorize the webapp for accessing the Gmail account:

Go to a web browser and navigate to http://localhost:9999/journal-submit/authorize (or whatever address the Tomcat server is running at) and follow the OAuth2 instructions.

When you are provided with an auth code, copy it and make a call to http://localhost:9999/journal-submit/authorize?code=whateverthecodeis to authorize the webapp.

Restart tomcat and run http://localhost:9999/journal-submit/test and you should get a test message in your journal-submit.log file.

Configuring the Gmail labels

The labels that are being looked for by the webapp need to be configured in the Gmail settings. Log into the Gmail web site as the journal-submit-app@datadryad.org account. Create matching labels for the server you're configuring, matching the <default.submit.journal.label> and <default.submit.journal.error.label> settings in the Maven settings file (as above). If you want to listen for all incoming emails, set a filter to label all incoming messages with the label. The webapp will remove that label as it processes the labeled emails. Do not use an existing label for a new server instance, or else the new server's webapp will remove some other server's labels!

Configuring the journals

Journals are configured via journal concepts. Concepts can be created/updated from the Dryad website (Profile → Manage Journal Settings) or from the command line by updating JSON files and using CURL to interact with a REST API to post the files. For more information about how to create/update concepts, see Journal Concepts.

A sample configuration from the journal authority control is shown below, along with descriptions of the fields.

Field Name Sample Value Description
dc.description.provenance Manually updated to track changes to the concept.
journal.journalID EXJ Provided by the journal in the PIQ
journal.metadataDir /opt/dryad/submission/journalMetadata/EXJ Constructed using the directory '/opt/dryad/submission' plus the journal ID
journal.parsingScheme manuscriptCentral Provided by the journal in the PIQ
journal.integrated true Set to false until integration goes live, then set to true
journal.allowReviewWorkflow false Provided by the journal in the PIQ
journal.embargoAllowed true Provided by the journal in the PIQ
journal.publicationBlackout true Provided by the journal in the PIQ
journal.sponsorName The Example Journal Society of America Provided by the journal in the PIQ
journal.notifyOnReview someone@somewhere.com, automated-messages@datadryad.org Provided by the journal in the PIQ
journal.notifyOnArchive someone-else@somewhere.com, automated-messages@datadryad.org Provided by the journal in the PIQ
journal.notifyWeekly and-yet-another@somewhere.com, automated-messages@datadryad.org Provided by the journal in the PIQ
canonicalManuscriptNumberPattern .*?(EXJ-D-\d+-\d+).*? This is a regular expression (regex) that defines the search pattern used by the web app to parse the the MS Reference ID field in journal notifications. Typically, the MS Reference ID contains a group of alphabetic characters followed by a group (or groups) of numeric characters. Sometimes the groups of characters are separated by a dash or a slash. Some journals add version indicators (for example, .R1, .R2 or A, B, C) at the end. A few journals add a user id at the beginning of the MS Reference ID for security reasons. A helpful tool for developing and testing the regular expression can be found at https://regex101.com/.
journal.issn 1234-5678 Provided by the journal in the PIQ
journal.coverImage /themes/Dryad/images/coverimages/EXJ.png Request from journal during integration set up
journal.hasJournalPage true Should be set to true for all integrated journals
organization.fullName Example Journal Provided by the journal in the PIQ
organization.paymentPlanType SUBSCRIPTION DEFERRED or SUBSCRIPTION or PREPAID
organization.customerID 0000000
organization.description Example Journal (EXJ) is a monthly, online-only, open access, peer-reviewed journal promoting the rapid dissemination of newly developed, innovative tools and protocols in all areas of the plant sciences, including genetics, structure, function, development, evolution, systematics, and ecology. EXJ is a publication of The Example Journal Society of America. May be provided by the journal during integration or obtained from web page
organization.website http://www/home/publications/EXJ.html Link to the website to access journal content

Testing

Most testing can be done by looking at the journal-submit.log file on the server. After authorization has been completed as above, test the gmail connection: `curl http://localhost:9999/journal-submit/test`. You should get the following snippet in your log:

2015-11-17 20:13:44,926 INFO  org.datadryad.submission.DryadEmailSubmission @ got 1 test messagesMessage: 1503fbdddfe59413, The journal-submit webapp is working!<span style="font-family: sans-serif, Arial, Verdana, 'Trebuchet MS';"> </span>

Troubleshooting

If running http://localhost:9999/journal-submit/test returns errors, it is possible that the stored credential has gotten out of sync. Delete the credential file on the server (stored at /opt/dryad/submission/credential/StoredCredential) and reauthorize the webapp.

If running http://localhost:9999/journal-submit/retrieve returns an error saying that messages are still being processed, you can clear that by running http://localhost:9999/journal-submit/clear.


Monitoring (ongoing)

The journal-submit-webapp email account should be checked at least once daily.

Checking Email Notifications for Journal Submit Errors

  1. Open the journal submit webapp email address in gmail and log in to view email notifications being processed
  2. Check to make sure that the journal-submit folder does not have more than a few emails. If there are more than a few emails, go to the Troubleshooting section shown above.
  3. Check the journal-submit-error folder. If it contains emails, inspect each email to determine the possible reason for the error. Here are a few examples of common issues and how to resolve them:
    • Email is a congratulatory email sent as a “reply all” that includes the journal-submit email address - remove the journal-submit-error label
    • Email is intended for the journal and was sent using “reply-all” - if this does not apply to Dryad, remove the journal-submit-error label
    • Email contains a question that should be answered by Dryad - forward to the help desk
    • Email is an incomplete duplicate of a more detailed email - remove the journal-submit-error label
    • Email is in HTML or rich text and not in plain text - copy to new email and send to journal-submit email address
    • Email does not have one or more of the following six mandatory fields or the value for the field is left blank - contact the journal
      • Journal Code
      • Journal Name
      • MS Reference Number
      • MS Title
      • MS Authors
      • Article Status
  4. After inspecting the email notification and addressing any issues, it may be necessary to reprocess the email notification. This is done by changing the labels as follows: (Note: multiple emails can be processed at once. If you opt to process multiple emails at once, you should do no more than 100 at one time to ensure that the system works properly)
    1. Change the labels of any email to be reprocessed as follows
      1. If reprocessing a single email, while viewing the email, click on "Labels" to display the available labels
        OR
        If reprocessing multiple emails, from the journal-submit-error folder, click on the checkboxes to select the emails to reprocess and then click on "Labels" to display the available labels
      2. From the list of "Labels," remove the checkmark in front of journal-submit-error
      3. Add a checkmark in front of journal-submit
      4. Click on Apply
    2. Log in to the Dryad server to track the journal-submit.log to ensure the email processing webapp is processing emails by using the following command:
      tail -f journal-submit.log
    3. If emails do not seem to be processing correctly, see the Troubleshooting section shown above.

Additional implementation notes

Relation to DSpace

The journal-submit is a separate webapp (DSpace module) but it is related to the standard DSpace submission process (which Dryad has heavily modified). It uses the same manuscript metadata classes as the Dryad REST API webapp.

Control flow

DryadEmailSubmission handles the retrieval of messages, based on a timer or when a retrieval is manually requested. When it receives a message, it processes it. Currently only MIME messages are processed by the webapp; HTML messages will be ignored.

The processor looks for content corresponding to the format specified in Journal Metadata. It uses the email's provided Journal Code or Journal Name to look for a matching journal concept in the authority control; if it finds one, it uses the specified parsingScheme to choose an email parser.

The parser will return a Manuscript object that contains the Dryad metadata. This Manuscript object can either be written out as an xml file or stored in the postgres database in the manuscript table.

Email parsers

The email parsers all derive from a basic parser that parses commonly-used fields found in the journal emails (see Journal Metadata for the proper formatting). Derived parser classes can modify the mapping of email fields to Dryad fields (by adding entries to fieldToXMLTagMap) or handle processing of journal-specific tags to better map email fields to Dryad fields (by overriding parseSpecificTags()).

The current derived classes handle specialized cases that differ from the base case:

  • EmailParserForManuscriptCentral - This is the most commonly used parser. Also handles fields specific to Genomic Resources Notes Technology (e.g. MS Citation Title, MS Citation Authors)
  • EmailParserForAmNat - AmNat has a large set of field tags that we don't process. This parser also uses variants in many of the field names and parses the corresponding author fields differently.
  • EmailParserForBmcEvoBio - This parser inherits from the ManuscriptCentral parser, with a few variant field names.
  • EmailParserForEcoApp - This parser includes a few variant field names and parses the corresponding author differently.

Automating the Review Workflow

The "in review" workflow stage is a holding ground for data submissions associated with manuscripts in review. When integrated journals send metadata emails to Dryad with an article status of "submitted," any subsequent data packages created using that manuscript ID are submitted to the review workflow stage.

Options for Content in the Review Workflow

Actions can be taken on items matching manuscript metadata when Dryad is notified of a change in status. 

  • Accepted: matching packages will be pushed into the curation workflow stage.
  • Published: matching packages will be pushed into the curation workflow stage. (TO BE IMPLEMENTED: they should be automatically archived.)
  • Rejected: matching packages will be returned to the submitters' workspace.
  • Needs Revision: matching packages will be returned to the submitters' workspace.

Another way of looking at this: there are two possible actions that can be taken. Items can be"approved" from the review workspace, which moves them to the curation workflow, or they can be"rejected," which returns them to the submitter's workspace.

Finding Content in the Review Workflow

Items in the review workflow are identified by their workflow_id numbers. These may or may not correspond to information matching manuscript metadata that we receive from journals.

Dryad attempts to match every review package that corresponds to an article's metadata. The system examines all review packages for the given journal by the following criteria:

  • Dryad DOI: If this is provided, this is the easiest case. Even if no other metadata matches, review packages with the Dryad DOI will be considered a match.
  • Manuscript ID: If this is provided, all review packages for this journal that contain a matching dc.identifier.manuscriptNumber value will be considered matches.
  • Authors: Even if neither Dryad DOI nor manuscript ID are provided, Dryad will compare authors for all review packages for the given journal. If every author is matched for both the review package and the manuscript metadata, it is considered a match.

Note that all review packages for a journal that match any of these criteria will be affected. This is because submitters sometimes create multiple submissions that correspond to (different parts of) the same manuscript metadata.

Managing Content in the Review Workflow

The review workflow can be managed either manually or automatically.

Manually via the command line

./{dspace.dir}/bin/dspace review-item {-i workflow_id|-m manuscript_number} -a {true|false}

This command requires 2 parameters. The first parameter indicates the item, and can take one of two forms:

  • -i the id of the workflow item (workflow_item_id instead of item_id)
  • -m the manuscript number associated with the item

The second parameter indicates the status of the item:

  • -a whether or not the item has been approved

Automatically via the journal-submit webapp

Emails sent by journals to the journal-submit account with a status that corresponds to an action will be automatically matched to packages in the review workflow. Those matching packages will be pushed to curation or returned to submitter, as necessary.

Automatically via the REST API

See http://wiki.datadryad.org/index.php?title=Dryad_REST_API_Technology&section=23

See Also