Difference between revisions of "Journal Metadata Processing Technology"

From Dryad wiki
Jump to: navigation, search
(Testing the end-to-end Workflow)
(Testing the end-to-end Workflow)
Line 171: Line 171:
 
== Testing the end-to-end Workflow ==
 
== Testing the end-to-end Workflow ==
  
Testing the journal/email workflow requires coordination of several different servers and settings.
+
Testing the journal/email workflow requires coordination of several different servers and settings. Set up the servers:
  
1. "Target
+
* On the target Dryad/Dash server, ####
 +
** Ensure it has the correct authentication for the Dryad/DSpace server. This is in app_config.yml, old_dryad_url and old_dryad_access_token
 +
* On the Dryad/DSpace server (dryad-dspace-server) that will process emails,
 +
** Ensure it has authentication for the correct email account. This is in dspace.cfg, submit.journal.clientsecrets
 +
** Check which email label it will look for. This is in dspace.cfg, submit.journal.email.label
 +
** Ensure it has the correct authentication for the Dryad/Dash server. This is in dspace.cfg, dash.server and associated auth key.
  
 
To run the test:
 
To run the test:
* Send an email to journal-submit@datadryad.org with the required metadata. Something like:
+
* Send an email to journal-submit@datadryad.org with the required metadata. Note that you must use a "real" journal name, and the Dryad/DSpace server must have the appropriate settings in the associated journal concept. Something like:
 
  <nowiki>
 
  <nowiki>
 
Journal Name: Molecular Ecology
 
Journal Name: Molecular Ecology
Line 185: Line 190:
 
Contact Author: Author Authorious
 
Contact Author: Author Authorious
 
Contact Author Email: fake-author@datadryad.org
 
Contact Author Email: fake-author@datadryad.org
Keywords: great research, stellar data
+
Keywords: great research, stellar data </nowiki>
</nowiki>
+
* In the GMail account journal-submit-app@datadryad.org, apply the proper label to the email
 +
* Either wait, or force the email to be processed by accessing http://dryad-dspace-server.datadryad.org/journal-submit/retrieve
 +
* View the logs on the Dryad/V1 server to ensure the email was processed correctly.
 +
* On the Dryad/Dash server, create a submission with the associated journal name and manuscript number. It should correctly import the metadata and use the journal settings to determine whether peer review is allowed and whether to charge the user. Normally, you will want to put the submission into peer_review status to test the subsequent steps.
 +
* Either wait for Merritt processing of the Dryad/Dash submission, or force the notifier to run (notifier_force.sh on most servers)
 +
* Send another email to journal-submit@datadryad.org to update the status of the submission. This can be exactly the same message as above, just with the Article Status changed to either "accepted" or "rejected".
 +
* Again, apply the correct label to the email.
 +
* Again, wait or force the email to be processed.
 +
* View the results in server logs and the status of the Dryad/Dash submission.
  
 
== Troubleshooting ==
 
== Troubleshooting ==

Revision as of 09:28, 25 October 2019

Overview

The Dryad journal-submit webapp imports metadata that is mailed from integrated journals to Dryad. The email is parsed for article publication metadata and that metadata is used to populate the form that the submitter sees when he or she comes to Dryad and uses either the journal-provided link or manually enters a manuscript number that Dryad has metadata for.

Journal Email Workflow

  1. Journal sends email with article metadata to the journal-submit@datadryad.org email address, which forwards to journal-submit-app@datadryad.org
  2. Messages in journal-submit-app@datadryad.org are automatically labeled by gmail with a special label.
  3. At a regular interval, the journal-submit webapp retrieves the newest emails with the special label, saving those messages to process further and removing the label. The specific label is set by the maven settings on the server, so each Dryad server can react to a different label.
  4. The webapp processes the new emails: it gets the byte stream with the email content and detects the journal's name.
  5. The webapp then looks up the journal name in the journal authority control, to learn the `parsingScheme` to use. The value of this attribute is used to match on the parsing classes in the journal-submit's codebase.
  6. When a particular parser is determined to be the correct one to use, the webapp uses that to parse the remainder of the email, putting metadata values into a ParsingResult object.
  7. The ParsingResult class is then used to export the metadata into an XML file; the location of this file is determined by the `metadataDir` value from the journal authority control.
  8. Once this file is written (using the manuscript number as the name of the file), the journal-submit webapp is done with its process and control moves to the standard DSpace/Dryad submission process, which uses the written XML file to populate the submission form when the author comes to Dryad to complete the submission process.
  9. Later, a submitter will use the Submission System to retrieve the parsed metadata and initiate a new data submission.

If an email fails to parse properly because it's malformed (see Journal Metadata for proper email formatting), it will be tagged with the server's gmail error label.

Users with access to the journal-submit-app gmail account can also manually add and remove the gmail labels to re-process specific emails. To trigger the journal-submit webapp manually, use the url http://whatever.server/journal-submit/retrieve.

Configuration

Overview

There are two components to the workflow: the journal-submit webapp, which runs on the server, and the Gmail account that the webapp listens to. Both can be tweaked separately.

Configuring the journal-submit webapp

Several settings need to be in the server's Maven settings.xml file. The following are example settings for the main production server. Please choose unique settings for your own server's label and error label, e.g. dev-journal-submit and dev-journal-submit-error for the dev server.

The JSON data for the clientsecret can be found in the Maven settings on dev or production, or can be obtained directly from Google via https://console.developers.google.com/project/journal-submit/apiui/credential and is the downloaded JSON for "Client ID for native application." 


<!-- Journal-submit webapp -->
<!-- the gmail account to listen to: -->
<default.submit.journal.email>journal-submit-app@datadryad.org</default.submit.journal.email>

<!-- the gmail label to look for: -->
<default.submit.journal.label>journal-submit</default.submit.journal.label>

<!-- the gmail label applied to messages that fail to parse properly: -->
<default.submit.journal.error.label>journal-submit-error</default.submit.journal.error.label>

<!-- how frequently the webapp should retrieve messages from gmail, in seconds: -->
<default.submit.journal.email.timer>600</default.submit.journal.email.timer>

<!-- clientsecret for authorization of the gmail account. -->
<!-- The JSON can be obtained at https://console.developers.google.com/project/journal-submit/apiui/credential -->
<!-- and is the downloaded JSON for "Client ID for native application." -->
<default.submit.journal.clientsecret>{JSON GOES HERE}</default.submit.journal.clientsecret>

Once the settings have been updated, rebuild and deploy your server.

Authorizing the webapp

This should only need to be done when a server is deploying the webapp for the very first time: the credentials should remain authorized unless and until someone revokes the access through the Google Developer Console.

Once configured, built, and running, start the Tomcat instance and authorize the webapp for accessing the Gmail account:

Go to a web browser and navigate to http://localhost:9999/journal-submit/authorize (or whatever address the Tomcat server is running at) and follow the OAuth2 instructions.

When you are provided with an auth code, copy it and make a call to http://localhost:9999/journal-submit/authorize?code=whateverthecodeis to authorize the webapp.

Restart tomcat and run http://localhost:9999/journal-submit/test and you should get a test message in your journal-submit.log file.

Configuring the Gmail labels

The labels that are being looked for by the webapp need to be configured in the Gmail settings. Log into the Gmail web site as the journal-submit-app@datadryad.org account. Create matching labels for the server you're configuring, matching the <default.submit.journal.label> and <default.submit.journal.error.label> settings in the Maven settings file (as above). If you want to listen for all incoming emails, set a filter to label all incoming messages with the label. The webapp will remove that label as it processes the labeled emails. Do not use an existing label for a new server instance, or else the new server's webapp will remove some other server's labels!

Configuring the journals

Journals are configured via journal concepts. Concepts can be created/updated from the Dryad website (Profile → Manage Journal Settings) or from the command line by updating JSON files and using CURL to interact with a REST API to post the files. For more information about how to create/update concepts, see Journal Concepts.

A sample configuration from the journal authority control is shown below, along with descriptions of the fields.

Field Name Sample Value Description
dc.description.provenance Manually updated to track changes to the concept.
journal.journalID EXJ Provided by the journal in the PIQ
journal.metadataDir /opt/dryad/submission/journalMetadata/EXJ Constructed using the directory '/opt/dryad/submission' plus the journal ID
journal.parsingScheme manuscriptCentral Provided by the journal in the PIQ
journal.integrated true Set to false until integration goes live, then set to true
journal.allowReviewWorkflow false Provided by the journal in the PIQ
journal.embargoAllowed true Provided by the journal in the PIQ
journal.publicationBlackout true Provided by the journal in the PIQ
journal.sponsorName The Example Journal Society of America Provided by the journal in the PIQ
journal.notifyOnReview someone@somewhere.com, automated-messages@datadryad.org Provided by the journal in the PIQ
journal.notifyOnArchive someone-else@somewhere.com, automated-messages@datadryad.org Provided by the journal in the PIQ
journal.notifyWeekly and-yet-another@somewhere.com, automated-messages@datadryad.org Provided by the journal in the PIQ
canonicalManuscriptNumberPattern .*?(EXJ-D-\d+-\d+).*? This is a regular expression (regex) that defines the search pattern used by the web app to parse the the MS Reference ID field in journal notifications. Typically, the MS Reference ID contains a group of alphabetic characters followed by a group (or groups) of numeric characters. Sometimes the groups of characters are separated by a dash or a slash. Some journals add version indicators (for example, .R1, .R2 or A, B, C) at the end. A few journals add a user id at the beginning of the MS Reference ID for security reasons. A helpful tool for developing and testing the regular expression can be found at https://regex101.com/.
journal.issn 1234-5678 Provided by the journal in the PIQ
journal.coverImage /themes/Dryad/images/coverimages/EXJ.png Request from journal during integration set up
journal.hasJournalPage true Should be set to true for all integrated journals
organization.fullName Example Journal Provided by the journal in the PIQ
organization.paymentPlanType SUBSCRIPTION DEFERRED or SUBSCRIPTION or PREPAID
organization.customerID 0000000
organization.description Example Journal (EXJ) is a monthly, online-only, open access, peer-reviewed journal promoting the rapid dissemination of newly developed, innovative tools and protocols in all areas of the plant sciences, including genetics, structure, function, development, evolution, systematics, and ecology. EXJ is a publication of The Example Journal Society of America. May be provided by the journal during integration or obtained from web page
organization.website http://www/home/publications/EXJ.html Link to the website to access journal content

Testing the Metadata Processing

Most testing can be done by looking at the journal-submit.log file on the server. After authorization has been completed as above, test the gmail connection: `curl http://localhost:9999/journal-submit/test`. You should get the following snippet in your log:

2015-11-17 20:13:44,926 INFO  org.datadryad.submission.DryadEmailSubmission @ got 1 test messagesMessage: 1503fbdddfe59413, The journal-submit webapp is working!<span style="font-family: sans-serif, Arial, Verdana, 'Trebuchet MS';"> </span>

Testing the end-to-end Workflow

Testing the journal/email workflow requires coordination of several different servers and settings. Set up the servers:

  • On the target Dryad/Dash server, ####
    • Ensure it has the correct authentication for the Dryad/DSpace server. This is in app_config.yml, old_dryad_url and old_dryad_access_token
  • On the Dryad/DSpace server (dryad-dspace-server) that will process emails,
    • Ensure it has authentication for the correct email account. This is in dspace.cfg, submit.journal.clientsecrets
    • Check which email label it will look for. This is in dspace.cfg, submit.journal.email.label
    • Ensure it has the correct authentication for the Dryad/Dash server. This is in dspace.cfg, dash.server and associated auth key.

To run the test:

  • Send an email to journal-submit@datadryad.org with the required metadata. Note that you must use a "real" journal name, and the Dryad/DSpace server must have the appropriate settings in the associated journal concept. Something like:
Journal Name: Molecular Ecology
MS Reference Number: abc123
Article Status: submitted
MS Title: Some great and unique title
MS Authors: Author Authorious
Contact Author: Author Authorious
Contact Author Email: fake-author@datadryad.org
Keywords: great research, stellar data  
  • In the GMail account journal-submit-app@datadryad.org, apply the proper label to the email
  • Either wait, or force the email to be processed by accessing http://dryad-dspace-server.datadryad.org/journal-submit/retrieve
  • View the logs on the Dryad/V1 server to ensure the email was processed correctly.
  • On the Dryad/Dash server, create a submission with the associated journal name and manuscript number. It should correctly import the metadata and use the journal settings to determine whether peer review is allowed and whether to charge the user. Normally, you will want to put the submission into peer_review status to test the subsequent steps.
  • Either wait for Merritt processing of the Dryad/Dash submission, or force the notifier to run (notifier_force.sh on most servers)
  • Send another email to journal-submit@datadryad.org to update the status of the submission. This can be exactly the same message as above, just with the Article Status changed to either "accepted" or "rejected".
  • Again, apply the correct label to the email.
  • Again, wait or force the email to be processed.
  • View the results in server logs and the status of the Dryad/Dash submission.

Troubleshooting

If running http://localhost:9999/journal-submit/test returns errors, it is possible that the stored credential has gotten out of sync. Delete the credential file on the server (stored at /opt/dryad/submission/credential/StoredCredential) and reauthorize the webapp.

If running http://localhost:9999/journal-submit/retrieve returns an error saying that messages are still being processed, you can clear that by running http://localhost:9999/journal-submit/clear.


Monitoring (ongoing)

The journal-submit-webapp email account should be checked at least once daily.

Checking Email Notifications for Journal Submit Errors

  1. Open the journal submit webapp email address in gmail and log in to view email notifications being processed
  2. Check to make sure that the journal-submit folder does not have more than a few emails. If there are more than a few emails, go to the Troubleshooting section shown above.
  3. Check the journal-submit-error folder. If it contains emails, inspect each email to determine the possible reason for the error. Here are a few examples of common issues and how to resolve them:
    • Email is a congratulatory email sent as a “reply all” that includes the journal-submit email address - remove the journal-submit-error label
    • Email is intended for the journal and was sent using “reply-all” - if this does not apply to Dryad, remove the journal-submit-error label
    • Email contains a question that should be answered by Dryad - forward to the help desk
    • Email is an incomplete duplicate of a more detailed email - remove the journal-submit-error label
    • Email is in HTML or rich text and not in plain text - copy to new email and send to journal-submit email address
    • Email does not have one or more of the following six mandatory fields or the value for the field is left blank - contact the journal
      • Journal Code
      • Journal Name
      • MS Reference Number
      • MS Title
      • MS Authors
      • Article Status
  4. After inspecting the email notification and addressing any issues, it may be necessary to reprocess the email notification. This is done by changing the labels as follows: (Note: multiple emails can be processed at once. If you opt to process multiple emails at once, you should do no more than 100 at one time to ensure that the system works properly)
    1. Change the labels of any email to be reprocessed as follows
      1. If reprocessing a single email, while viewing the email, click on "Labels" to display the available labels
        OR
        If reprocessing multiple emails, from the journal-submit-error folder, click on the checkboxes to select the emails to reprocess and then click on "Labels" to display the available labels
      2. From the list of "Labels," remove the checkmark in front of journal-submit-error
      3. Add a checkmark in front of journal-submit
      4. Click on Apply
    2. Log in to the Dryad server to track the journal-submit.log to ensure the email processing webapp is processing emails by using the following command:
      tail -f journal-submit.log
    3. If emails do not seem to be processing correctly, see the Troubleshooting section shown above.

Additional implementation notes

Relation to DSpace

The journal-submit is a separate webapp (DSpace module) but it is related to the standard DSpace submission process (which Dryad has heavily modified). It uses the same manuscript metadata classes as the Dryad REST API webapp.

Control flow

DryadEmailSubmission handles the retrieval of messages, based on a timer or when a retrieval is manually requested. When it receives a message, it processes it. Currently only MIME messages are processed by the webapp; HTML messages will be ignored.

The processor looks for content corresponding to the format specified in Journal Metadata. It uses the email's provided Journal Code or Journal Name to look for a matching journal concept in the authority control; if it finds one, it uses the specified parsingScheme to choose an email parser.

The parser will return a Manuscript object that contains the Dryad metadata. This Manuscript object can either be written out as an xml file or stored in the postgres database in the manuscript table.

Email parsers

The email parsers all derive from a basic parser that parses commonly-used fields found in the journal emails (see Journal Metadata for the proper formatting). Derived parser classes can modify the mapping of email fields to Dryad fields (by adding entries to fieldToXMLTagMap) or handle processing of journal-specific tags to better map email fields to Dryad fields (by overriding parseSpecificTags()).

The current derived classes handle specialized cases that differ from the base case:

  • EmailParserForManuscriptCentral - This is the most commonly used parser. Also handles fields specific to Genomic Resources Notes Technology (e.g. MS Citation Title, MS Citation Authors)
  • EmailParserForAmNat - AmNat has a large set of field tags that we don't process. This parser also uses variants in many of the field names and parses the corresponding author fields differently.
  • EmailParserForBmcEvoBio - This parser inherits from the ManuscriptCentral parser, with a few variant field names.
  • EmailParserForEcoApp - This parser includes a few variant field names and parses the corresponding author differently.

Automating the Review Workflow

The "in review" workflow stage is a holding ground for data submissions associated with manuscripts in review. When integrated journals send metadata emails to Dryad with an article status of "submitted," any subsequent data packages created using that manuscript ID are submitted to the review workflow stage.

Options for Content in the Review Workflow

Actions can be taken on items matching manuscript metadata when Dryad is notified of a change in status. 

  • Accepted: matching packages will be pushed into the curation workflow stage.
  • Published: matching packages will be pushed into the curation workflow stage. (TO BE IMPLEMENTED: they should be automatically archived.)
  • Rejected: matching packages will be returned to the submitters' workspace.
  • Needs Revision: matching packages will be returned to the submitters' workspace.

Another way of looking at this: there are two possible actions that can be taken. Items can be"approved" from the review workspace, which moves them to the curation workflow, or they can be"rejected," which returns them to the submitter's workspace.

Finding Content in the Review Workflow

Items in the review workflow are identified by their workflow_id numbers. These may or may not correspond to information matching manuscript metadata that we receive from journals.

Dryad attempts to match every review package that corresponds to an article's metadata. The system examines all review packages for the given journal by the following criteria:

  • Dryad DOI: If this is provided, this is the easiest case. Even if no other metadata matches, review packages with the Dryad DOI will be considered a match.
  • Manuscript ID: If this is provided, all review packages for this journal that contain a matching dc.identifier.manuscriptNumber value will be considered matches.
  • Authors: Even if neither Dryad DOI nor manuscript ID are provided, Dryad will compare authors for all review packages for the given journal. If every author is matched for both the review package and the manuscript metadata, it is considered a match.

Note that all review packages for a journal that match any of these criteria will be affected. This is because submitters sometimes create multiple submissions that correspond to (different parts of) the same manuscript metadata.

Managing Content in the Review Workflow

The review workflow can be managed either manually or automatically.

Manually via the command line

./{dspace.dir}/bin/dspace review-item {-i workflow_id|-m manuscript_number} -a {true|false}

This command requires 2 parameters. The first parameter indicates the item, and can take one of two forms:

  • -i the id of the workflow item (workflow_item_id instead of item_id)
  • -m the manuscript number associated with the item

The second parameter indicates the status of the item:

  • -a whether or not the item has been approved

Automatically via the journal-submit webapp

Emails sent by journals to the journal-submit account with a status that corresponds to an action will be automatically matched to packages in the review workflow. Those matching packages will be pushed to curation or returned to submitter, as necessary.

Automatically via the REST API

See http://wiki.datadryad.org/index.php?title=Dryad_REST_API_Technology&section=23

See Also