Journal Metadata Processing Technology

= Overview =

The Dryad journal-submit webapp imports metadata that is mailed from integrated journals to Dryad. The email is parsed for article publication metadata and that metadata is used to populate the form that the submitter sees when he or she comes to Dryad and uses either the journal-provided link or manually enters a manuscript number that Dryad has metadata for.

Journal Email Workflow

 * 1) Journal sends email with article metadata to the journal-submit@datadryad.org email address, which forwards to journal-submit-app@datadryad.org
 * 2) Messages in journal-submit-app@datadryad.org are automatically labeled by gmail with a special label.
 * 3) At a regular interval, the journal-submit webapp retrieves the newest emails with the special label, saving those messages to process further and removing the label. The specific label is set by the maven settings on the server, so each Dryad server can react to a different label.
 * 4) The webapp processes the new emails: it gets the byte stream with the email content and detects the journal's name.
 * 5) The webapp then looks up the journal name in the journal authority control, to learn the `parsingScheme` to use. The value of this attribute is used to match on the parsing classes in the journal-submit's codebase.
 * 6) When a particular parser is determined to be the correct one to use, the webapp uses that to parse the remainder of the email, putting metadata values into a ParsingResult object.
 * 7) The ParsingResult class is then used to export the metadata into an XML file; the location of this file is determined by the `metadataDir` value from the journal authority control.
 * 8) Once this file is written (using the manuscript number as the name of the file), the journal-submit webapp is done with its process and control moves to the standard DSpace/Dryad submission process, which uses the written XML file to populate the submission form when the author comes to Dryad to complete the submission process.
 * 9) Later, a submitter will use the Submission System to retrieve the parsed metadata and initiate a new data submission.

If an email fails to parse properly because it's malformed (see Journal Metadata for proper email formatting), it will be tagged with the server's gmail error label.

Users with access to the journal-submit-app gmail account can also manually add and remove the gmail labels to re-process specific emails. To trigger the journal-submit webapp manually, use the url http://whatever.server/journal-submit/retrieve.

Overview
There are two components to the workflow: the journal-submit webapp, which runs on the server, and the Gmail account that the webapp listens to. Both can be tweaked separately.

Configuring the journal-submit webapp
Several settings need to be in the server's Maven settings.xml file. The following are example settings for the main production server. Please choose unique settings for your own server's label and error label, e.g. dev-journal-submit and dev-journal-submit-error for the dev server.

The JSON data for the clientsecret can be found in the Maven settings on dev or production, or can be obtained directly from Google via https://console.developers.google.com/project/journal-submit/apiui/credential and is the downloaded JSON for "Client ID for native application."

journal-submit-app@datadryad.org

journal-submit

journal-submit-error

600

{JSON GOES HERE} Once the settings have been updated, rebuild and deploy your server.

Authorizing the webapp
This should only need to be done when a server is deploying the webapp for the very first time: the credentials should remain authorized unless and until someone revokes the access through the Google Developer Console.

Once configured, built, and running, start the Tomcat instance and authorize the webapp for accessing the Gmail account:

Go to a web browser and navigate to http://localhost:9999/journal-submit/authorize (or whatever address the Tomcat server is running at) and follow the OAuth2 instructions.

When you are provided with an auth code, copy it and make a call to http://localhost:9999/journal-submit/authorize?code=whateverthecodeis to authorize the webapp.

Restart tomcat and run http://localhost:9999/journal-submit/test and you should get a test message in your journal-submit.log file.

Configuring the Gmail labels
The labels that are being looked for by the webapp need to be configured in the Gmail settings. Log into the Gmail web site as the journal-submit-app@datadryad.org account. Create matching labels for the server you're configuring, matching the &lt;default.submit.journal.label&gt; and &lt;default.submit.journal.error.label&gt; settings in the Maven settings file (as above). If you want to listen for all incoming emails, set a filter to label all incoming messages with the label. The webapp will remove that label as it processes the labeled emails. Do not use an existing label for a new server instance, or else the new server's webapp will remove some other server's labels!

Configuring the journals
Journals are configured via journal concepts. Concepts can be created/updated from the Dryad website (Profile &rarr; Manage Journal Settings) or from the command line by updating JSON files and using CURL to interact with a REST API to post the files. For more information about how to create/update concepts, see Journal Concepts.

A sample configuration from the journal authority control is shown below, along with descriptions of the fields.

Testing the Metadata Processing
Most testing can be done by looking at the journal-submit.log file on the server. After authorization has been completed as above, test the gmail connection: `curl http://localhost:9999/journal-submit/test`. You should get the following snippet in your log: 2015-11-17 20:13:44,926 INFO org.datadryad.submission.DryadEmailSubmission @ got 1 test messagesMessage: 1503fbdddfe59413, The journal-submit webapp is working!

Testing the end-to-end Workflow
Testing the journal/email workflow requires coordination of a GMail account, a DSpace-based server (for email processing), and a Dash-based server (for data submissions). Set up the servers:


 * On the target Dryad/Dash server,
 * Ensure it has the correct authentication for the Dryad/DSpace server. This is in app_config.yml, old_dryad_url and old_dryad_access_token
 * On the Dryad/DSpace server (dryad-dspace-server) that will process emails,
 * Ensure it has a journal concept set up with the journal name and settings that you want to test.
 * Ensure it has authentication information for the correct email account. This account is usually journal-submit-app@datadryad.org for all servers, so shouldn't require a change. In dspace.cfg, submit.journal.clientsecrets
 * Check which email label it will look for. This is in dspace.cfg, submit.journal.email.label
 * Ensure it has the correct authentication for the Dryad/Dash server. This is in dspace.cfg, dash.server and associated auth key.

To run the test: Journal Name: Molecular Ecology MS Reference Number: abc123 Article Status: submitted MS Title: Some great and unique title MS Authors: Author Authorious Contact Author: Author Authorious Contact Author Email: fake-author@datadryad.org Keywords: great research, stellar data
 * Send an email to the Gmail account with the required metadata. The account can be journal-submit-app OR the more general journal-submit@datadryad.org. Note that you must use a "real" journal name to match the settings in the concept above. Something like:
 * In the GMail account (journal-submit-app@datadryad.org), apply the proper label to the email
 * Either wait, or force the email to be processed by accessing http://dryad-dspace-server.datadryad.org/journal-submit/retrieve
 * View the logs on the Dryad/V1 server to ensure the email was processed correctly.
 * On the Dryad/Dash server, create a submission with the associated journal name and manuscript number. It should correctly import the metadata and use the journal settings to determine whether peer review is allowed and whether to charge the user. Normally, you will want to put the submission into peer_review status to test the subsequent steps.
 * Either wait for Merritt processing of the Dryad/Dash submission, or force the notifier to run (notifier_force.sh on most servers)
 * Send another email to journal-submit@datadryad.org to update the status of the submission. This can be exactly the same message as above, just with the Article Status changed to either "accepted" or "rejected".
 * Again, apply the correct label to the email.
 * Again, wait or force the email to be processed.
 * View the results in server logs and the status of the Dryad/Dash submission.

Troubleshooting
If running http://localhost:9999/journal-submit/test returns errors, it is possible that the stored credential has gotten out of sync. Delete the credential file on the server (stored at /opt/dryad/submission/credential/StoredCredential) and reauthorize the webapp.

If running http://localhost:9999/journal-submit/retrieve returns an error saying that messages are still being processed, you can clear that by running http://localhost:9999/journal-submit/clear.

Monitoring (ongoing)
The journal-submit-webapp email account should be checked at least once daily.

Checking Email Notifications for Journal Submit Errors
OR If reprocessing multiple emails, from the journal-submit-error folder, click on the checkboxes to select the emails to reprocess and then click on "Labels" to display the available labels
 * 1) Open the journal submit webapp email address in gmail and log in to view email notifications being processed
 * 2) Check to make sure that the journal-submit folder does not have more than a few emails. If there are more than a few emails, go to the  Troubleshooting section shown above.
 * 3) Check the journal-submit-error folder. If it contains emails, inspect each email to determine the possible reason for the error. Here are a few examples of common issues and how to resolve them:
 * 4) * Email is a congratulatory email sent as a “reply all” that includes the journal-submit email address - remove the journal-submit-error label
 * 5) * Email is intended for the journal and was sent using “reply-all” - if this does not apply to Dryad, remove the journal-submit-error label
 * 6) * Email contains a question that should be answered by Dryad - forward to the help desk
 * 7) * Email is an incomplete duplicate of a more detailed email - remove the journal-submit-error label
 * 8) * Email is in HTML or rich text and not in plain text - copy to new email and send to journal-submit email address
 * 9) * Email does not have one or more of the following six mandatory fields or the value for the field is left blank - contact the journal
 * 10) ** Journal Code
 * 11) ** Journal Name
 * 12) ** MS Reference Number
 * 13) ** MS Title
 * 14) ** MS Authors
 * 15) ** Article Status
 * 16) After inspecting the email notification and addressing any issues, it may be necessary to reprocess the email notification. This is done by changing the labels as follows: (Note: multiple emails can be processed at once. If you opt to process multiple emails at once, you should do no more than 100 at one time to ensure that the system works properly)
 * 17) Change the labels of any email to be reprocessed as follows
 * 18) If reprocessing a single email, while viewing the email, click on "Labels" to display the available labels
 * 1) From the list of "Labels," remove the checkmark in front of journal-submit-error
 * 2) Add a checkmark in front of journal-submit
 * 3) Click on Apply
 * 4) Log in to the Dryad server to track the journal-submit.log to ensure the email processing webapp is processing emails by using the following command:
 * 5) If emails do not seem to be processing correctly, see the Troubleshooting section shown above.
 * 1) If emails do not seem to be processing correctly, see the Troubleshooting section shown above.

= Additional implementation notes =

Relation to DSpace
The journal-submit is a separate webapp (DSpace module) but it is related to the standard DSpace submission process (which Dryad has heavily modified). It uses the same manuscript metadata classes as the Dryad REST API webapp.

Control flow
DryadEmailSubmission handles the retrieval of messages, based on a timer or when a retrieval is manually requested. When it receives a message, it processes it. Currently only MIME messages are processed by the webapp; HTML messages will be ignored.

The processor looks for content corresponding to the format specified in Journal Metadata. It uses the email's provided Journal Code or Journal Name to look for a matching journal concept in the authority control; if it finds one, it uses the specified parsingScheme to choose an email parser.

The parser will return a Manuscript object that contains the Dryad metadata. This Manuscript object can either be written out as an xml file or stored in the postgres database in the  table.

Email parsers
The email parsers all derive from a basic parser that parses commonly-used fields found in the journal emails (see Journal Metadata for the proper formatting). Derived parser classes can modify the mapping of email fields to Dryad fields (by adding entries to ) or handle processing of journal-specific tags to better map email fields to Dryad fields (by overriding  ).

The current derived classes handle specialized cases that differ from the base case:


 * EmailParserForManuscriptCentral - This is the most commonly used parser. Also handles fields specific to Genomic Resources Notes Technology (e.g. MS Citation Title, MS Citation Authors)
 * EmailParserForAmNat - AmNat has a large set of field tags that we don't process. This parser also uses variants in many of the field names and parses the corresponding author fields differently.
 * EmailParserForBmcEvoBio - This parser inherits from the ManuscriptCentral parser, with a few variant field names.
 * EmailParserForEcoApp - This parser includes a few variant field names and parses the corresponding author differently.

= Automating the Review Workflow =

The "in review" workflow stage is a holding ground for data submissions associated with manuscripts in review. When integrated journals send metadata emails to Dryad with an article status of "submitted," any subsequent data packages created using that manuscript ID are submitted to the review workflow stage.

Options for Content in the Review Workflow
Actions can be taken on items matching manuscript metadata when Dryad is notified of a change in status.


 * Accepted: matching packages will be pushed into the curation workflow stage.
 * Published: matching packages will be pushed into the curation workflow stage. (TO BE IMPLEMENTED: they should be automatically archived.)
 * Rejected: matching packages will be returned to the submitters' workspace.
 * Needs Revision: matching packages will be returned to the submitters' workspace.

Another way of looking at this: there are two possible actions that can be taken. Items can be"approved" from the review workspace, which moves them to the curation workflow, or they can be"rejected," which returns them to the submitter's workspace.

Finding Content in the Review Workflow
Items in the review workflow are identified by their workflow_id numbers. These may or may not correspond to information matching manuscript metadata that we receive from journals.

Dryad attempts to match every review package that corresponds to an article's metadata. The system examines all review packages for the given journal by the following criteria:


 * Dryad DOI: If this is provided, this is the easiest case. Even if no other metadata matches, review packages with the Dryad DOI will be considered a match.
 * Manuscript ID: If this is provided, all review packages for this journal that contain a matching dc.identifier.manuscriptNumber value will be considered matches.
 * Authors: Even if neither Dryad DOI nor manuscript ID are provided, Dryad will compare authors for all review packages for the given journal. If every author is matched for both the review package and the manuscript metadata, it is considered a match.

Note that all review packages for a journal that match any of these criteria will be affected. This is because submitters sometimes create multiple submissions that correspond to (different parts of) the same manuscript metadata.

Managing Content in the Review Workflow
The review workflow can be managed either manually or automatically.

Manually via the command line
./{dspace.dir}/bin/dspace review-item {-i workflow_id|-m manuscript_number} -a {true|false} This command requires 2 parameters. The first parameter indicates the item, and can take one of two forms:


 * -i the id of the workflow item (workflow_item_id instead of item_id)
 * -m the manuscript number associated with the item

The second parameter indicates the status of the item:


 * -a whether or not the item has been approved

Automatically via the journal-submit webapp
Emails sent by journals to the journal-submit account with a status that corresponds to an action will be automatically matched to packages in the review workflow. Those matching packages will be pushed to curation or returned to submitter, as necessary.

Automatically via the REST API
See http://wiki.datadryad.org/index.php?title=Dryad_REST_API_Technology&amp;section=23