Journal Metadata Processing Technology

From Dryad wiki
Revision as of 10:15, 19 September 2011 by Ryan Scherle (talk | contribs) (Created page with "== Overview == The Dryad journal-submit modules allows the import of metadata that is mailed from integrated journals to Dryad. The email is parsed for article publication meta...")

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

Overview

The Dryad journal-submit modules allows the import of metadata that is mailed from integrated journals to Dryad. The email is parsed for article publication metadata and that metadata is used to populate the form that the submitter seems when he or she comes to Dryad and inputs an accepted manuscript number.

Workflow

# Journal sends email with article metadata to NESCent email and to the article author
# The email is addressed to journal-submit@datadryad.org. Godaddy is configured to forward this email to journal-submit@nescent.org.
# journal-submit@nescent.org is configured to route the email to dryad.journal.submit@gmail and to journal-submit-dev@nescent.org.  
# NESCent sends a copy of email to dryad.journal.submit Gmail address and also pipes the content to a `curl` command that POSTs the email content to the journal-submit webapp (this runs on the NESCent mail server).  Likewise, the email to journal-submit-dev is piped to a script (journalWebAppDev.sh) that pipes the content to a `curl` command that POSTs the content to the journal-submit webapp that runs on dev.datadryad.org.
# The journal-submit webapp reads the byte stream with the email content and detects the journal's name
# The webapp then looks up the journal name in the journal-submit configuration file, DryadJournalSubmission.properties, to learn the `parsingScheme` to use.  The value of this parameter is used to match on the parsing classes in the journal-submit's codebase
# When a particular parser is determined to be the correct one to use, the webapp uses that to parse the remainder of the email, putting metadata values into a ParsingResult object
# The ParsingResult class is then used to export the metadata into an XML file; the location of this file is determined by the `metadataDir` value from the DryadJournalSubmission.properties configuration file.
# Once this file is written (using the manuscript number as the name of the file), the journal-submit webapp is done with its process and control moves to the standard DSpace/Dryad submission process, which uses the written XML file to populate the submission form when the author comes to Dryad to complete the submission process.

Configuration

Below is a sample configuration from the journal-submit webapp's configuration file, DryadJournalSubmission.properties. Each journal handled by the submission system needs an entry in this file (even if the journal is not an integrated journal (i.e., doesn't send Dryad the article metadata in an email format)).

# all the journals configured in this file (using their parsing scheme code)
journal.order=amNat, BJLS, bmcEvoBio, bmjOpen, ecoApp, ecoMono, ecology, evolution, EvolApp, ecoFrontiers, intCompBio, EvolBiol, heredity, jhered, jpaleo, mbe, MolEcol, MolEcolRes, mpe, paleobio, sysBio

# American Naturalist
journal.amNat.fullname = The American Naturalist
# directory in which the resulting XML metadata file is stored
journal.amNat.metadataDir = /opt/dryad/submission/journalMetadata/amNat
# the parsing scheme (used to match against parsing class)
journal.amNat.parsingScheme = amNat
# whether we can receive article metadata emails from the journal
journal.amNat.integrated=true
# who to notify when a submission is reviewed
journal.amNat.notifyOnReview=ryantestAmNatReview@scherle.org
# who to notify when a submission is archived
journal.amNat.notifyOnArchive=ryantestAmNat@scherle.org

The location of the DryadJournalSubmission.properties file is also configurable via Maven profiles. The default value should be in ${DRYAD_HOME}/config/DryadJournalSubmission.properties, but if a different location is desired the following should be changed in the Maven profile used to built the project:

# the location of the configuration file for journal-submit webapp
<default.submit.journal.config>/opt/dryad/config/DryadJournalSubmission.properties</default.submit.journal.config>

One might want to do this to have a different set of email addresses used for notification purposes so non-project staff don't receive emails from the development instance of Dryad.

Testing

The application `curl` can be used to test the journal-submit module on a development machine.

curl --data-binary @message.test http://localhost:9999/journal-submit

Indicate that the data sent should be in binary form with the `--data-binary` parameter and pass a reference to a file name using the @ symbol (message.test in the example refers to a file on the local file system -- it should be relative to the place from which the script is run (or be an absolute file system path)). The module will output the XML that it generates or a stacktrace indicating the problem it found parsing the data submitted.

As a place to start debugging... the journal-submit webapp was originally written to concatenate textual values into an XML document, rather than using an XML-aware library. Recently an XML library was added to check the concatenated string to make sure it is well-formed XML. If problems appear in the future, this is a good place to start looking into them (since this check imposes restrictions that individual parsers may or may not have handled correctly). This check was added to the EmailParser class, which is an abstract class all parsers should implement.

Relation to DSpace

The journal-submit is a separate webapp (DSpace module) but it is related to the standard DSpace submission process (which Dryad has heavily modified)

List of Email Parsers

The following is a list of the current set of email parsers.

  • EmailParserForAmNat - This parser and the one for ManuscriptCentral have a similar structure, but different email fields. AmNat has a larger set of field tags. The tags map directly to XML element names.
  • EmailParserForBmcEvoBio - This parser is similar to the ManuscriptCentral parser. It differs that it ignores line breaks in the abstract field (line breaks may be used to separate sections in the abstract). It also accepts author lists joined by 'and' and separated by commas, rather than joined by semicolons.
  • EmailParserForEcoApp - Restructured parser that attempts to better separate the parsing and XML generation stages. Email tags are mapped to java classes (in the xml child package), which are subclasses of the xom Element class. XOM is not used for the final output - after each element is constructed, it is serialized and appended into a field in the ParsingResult returned object.
  • EmailParserForManuscriptCentral - Similar to AmNat, but a smaller set of tags.