Journal Metadata Processing Technology

From Dryad wiki
Revision as of 20:36, 28 October 2015 by Ryan Scherle (talk | contribs) (Gmail-based Workflow)

Jump to: navigation, search


The Dryad journal-submit modules allows the import of metadata that is mailed from integrated journals to Dryad. The email is parsed for article publication metadata and that metadata is used to populate the form that the submitter seems when he or she comes to Dryad and inputs an accepted manuscript number.

Gmail-based Workflow

  1. Journal sends email with article metadata to the email address, which forwards to
  2. Messages in are automatically labeled by gmail with a special label.
  3. At a regular interval, the journal-submit webapp retrieves the newest emails with the special label, saving those messages to process further and removing the label. The specific label is set by the maven settings on the server, so each Dryad server can react to a different label.
  4. The webapp processes the new emails: it gets the byte stream with the email content and detects the journal's name.
  5. The webapp then looks up the journal name in the journal authority control, to learn the `parsingScheme` to use. The value of this attribute is used to match on the parsing classes in the journal-submit's codebase.
  6. When a particular parser is determined to be the correct one to use, the webapp uses that to parse the remainder of the email, putting metadata values into a ParsingResult object.
  7. The ParsingResult class is then used to export the metadata into an XML file; the location of this file is determined by the `metadataDir` value from the journal authority control.
  8. Once this file is written (using the manuscript number as the name of the file), the journal-submit webapp is done with its process and control moves to the standard DSpace/Dryad submission process, which uses the written XML file to populate the submission form when the author comes to Dryad to complete the submission process.
  9. Later, a submitter will use the Submission System to retrieve the parsed metadata and initiate a new data submission.



There are two components to the workflow: the webapp, which runs on the server, and the Gmail account. Both can be tweaked separately.

Configuring the server webapp

Several settings need to be in the server's Maven settings.xml file. The correct values for the production server can be found on at /home/dhdryad/gmail.settings.xml, but for other servers and testing, you can modify these values as needed. For example, can have its own gmail label, dev-journal-submit.

<default.submit.journal.clientsecret>{JSON GOES HERE}</default.submit.journal.clientsecret>

Authorizing the webapp

This should only need to be done when a server is deploying the webapp for the very first time: the credentials should remain authorized unless and until someone revokes the access through the Google Developer Console.

In order to access the Gmail account, the clientsecret JSON data must be present in the Maven settings.xml file. OAuth requires a clientsecret for the credential exchange, associated with the Gmail account. The JSON can be obtained at and is the downloaded JSON for "Client ID for native application."

Once configured, built, and running, start the Tomcat instance and authorize the webapp for accessing the Gmail account:

Go to a web browser and navigate to http://localhost:9999/journal-submit/authorize (or whatever address the Tomcat server is running at) and follow the OAuth2 instructions.

When you are provided with an auth code, copy it and make a call to http://localhost:9999/journal-submit/authorize?code=whateverthecodeis to authorize the webapp.

Configuring the journals

Below is a sample configuration from the journal authority control.

journal.journalID appPlantSci
journal.fullname Applications in Plant Sciences
journal.metadataDir /opt/dryad/submission/journalMetadata/appPlantSci
journal.parsingScheme manuscriptCentral
journal.integrated false
journal.allowReviewWorkflow false
journal.embargoAllowed true
journal.publicationBlackout true
journal.subscriptionPaid true
journal.sponsorName The Botanical Society of America

It is important to use dummy email addresses on the non-production servers to avoid having journal staff receive emails for items that are being tested.

Configuring the Gmail labels

The labels that are being looked for by the webapp need to be configured in the Gmail settings. Log into the Gmail web site as the dryad.journal.submit account. Make or create a label for the server you're configuring, matching the <default.submit.journal.label> setting in the Maven settings file. If you want to listen for all incoming emails, set a filter to label all incoming messages with the label. The webapp will remove that label as it processes the labeled emails. Do not use an existing label for a new server instance, or else the new server's webapp will remove some other server's labels!

For testing on use the label "test-journal-submit."


Gmail testing

After authorization has been completed as above, test the gmail connection: `curl http://localhost:9999/journal-submit/test` should return some test messages and some snippets to the journal-submit.log file.

The processing can also be tested by using extra test labels. On your server's maven settings.xml, change the `<default.submit.journal.label>` value to something like "journal-submit-test". Then, in Gmail, log into the account specified by the `<>` value. Set some properly-formed message to have that same test label. If you execute `curl http://localhost:9999/journal-submit/retrieve`, you should see the correct values logging in the journal-submit.log file and the corresponding xml file will be created as well. 


If running http://localhost:9999/journal-submit/test returns errors, it is possible that the stored credential has gotten out of sync. Delete the credential file on the server (stored at /opt/dryad/submission/credential/StoredCredential) and reauthorize the webapp.

Relation to DSpace

The journal-submit is a separate webapp (DSpace module) but it is related to the standard DSpace submission process (which Dryad has heavily modified)

Email Parsers

The following is a list of the current set of email parsers.

  • EmailParserForAmNat - This parser and the one for ManuscriptCentral have a similar structure, but different email fields. AmNat has a larger set of field tags. The tags map directly to XML element names.
    • Authors' names are: first last, degree/title (e.g. Dr., Prof.). Names are separated by semicolons.
    • Classification terms are: Major: minor. Terms are separated by semicolons.
  • EmailParserForBmcEvoBio - This parser is similar to the ManuscriptCentral parser. It differs that it ignores line breaks in the abstract field (line breaks may be used to separate sections in the abstract). It also accepts author lists joined by 'and' and separated by commas, rather than joined by semicolons.
    • Authors are first last. Names are separated by commas, final author joined by 'and' (no comma).
    • Keywords (Classification terms) are: Major: minor. Terms are separated by line breaks.
  • EmailParserForEcoApp - Restructured parser that attempts to better separate the parsing and XML generation stages. Email tags are mapped to java classes (in the xml child package), which are subclasses of the xom Element class. XOM is not used for the final output - after each element is constructed, it is serialized and appended into a field in the ParsingResult returned object.
    • Authors' names are: first last. Names are separated by commas, final author joined by 'and' (with preceding comma).
    • Parsing of Classification terms are currently unknown (do not appear in example messages)
  • EmailParserForManuscriptCentral - Similar to AmNat, but a smaller set of tags.
    • Authors' name are: last, first. Names are separated by semicolons
    • Keywords (Classification terms) are comma-separated.
    • Also handles fields specific to Genomic Resources Notes Technology (e.g. MS Citation Title, MS Citation Authors)

Managing Content in the Review Workflow

The "in review" workflow stage is a holding ground for data submissions associated with manuscripts in review.

For approving/rejecting items which are in the review stage the following command can be used

./{dspace.dir}/bin/dspace review-item {-i workflow_id|-m manuscript_number} -a {true|false}

This class requires 2 parameters. The first parameter indicates the item, and can take one of two forms:

  • -i the id of the workflow item (workflow_item_id instead of item_id)
  • -m the manuscript number associated with the item

The second parameter indicates the status of the item:

  • -a whether or not the item has been approved

See Also