Automating Weekly Reports

From Dryad wiki
Jump to: navigation, search

Overview

This document outlines the requirements and specifications for automating weekly integration and curation summary reports.

Example reports with formatting are marked up in this Google doc

General Guidelines

The calculations mention some specific database fields or metadata. When possible, use the highest-level interface to this data. That could be DryadDataPackage, one of the journal statistics classes in /dspace/modules/api-stats/src/main/java/org/datadryad/journalstatistics/extractor, another Java utility class, it could be SOLR, or it could be the SQL database itself.

These are merely suggestions on how to approach this problem from Java with existing APIs and classes.

IntegrationReportData

Write a class that encapsulates the data that would be sent in an Integration Report email for a single journal:

// pseudo-code
class IntegrationReportData
String journalName
Date beginDate
Date endDate
String featuredDataDOI
Integer manuscriptsInReview
Integer manuscriptsAccepted
Integer dataPackagesInReview
Integer dataPackagesArchived

IntegrationReportGenerator

Write a class that generates IntegrationReportData objects from real data in Dryad, given a journal name, begin date, and end date.

Details on where this data may come from. Do review existing code for overlaps:

  • journalName - input / user specified
  • beginDate - input / user specified
  • endDate - input / user specified
  • featuredDataDOI - optional, input / user specified - Dryad staff routinely selects a data package to feature every week. This would not be automatically chosen, but the tool to generate the reports/emails could certainly fetch the description, link, citation, given a DOI
  • manuscriptsInReview - The number of manuscripts for which we've received metadata during the date window with a status of Review (The date is when we received the notice -e.g. when the XML file was created on the filesystem, when the email was received, or when the API call was made)
  • manuscriptsAccepted - The number of manuscripts for which we've received metadata during the date window with a status of Accepted

CurationReportData

Write a class that encapsulates the data that is sent in a Curator Weekly Report. This may share a common base class/interface with IntegrationReportData:

// pseudo-code
class CuratorReportData
Date beginDate
Date endDate
Integer archivedIntegratedPackageCount
Integer enteredBlackoutIntegratedPackageCount
Integer archivedNonIntegratedPackageCount
Integer enteredBlackoutNonIntegratedPackageCount
Integer inReviewPackageCount
Integer exitedBlackoutPackageCount

List<String> newJournals
List<String> nonIntegratedJournalsWithPackagesEnteringBlackout
List<String> nonIntegratedJournalsWithAlreadyPublishedArticles
List<String> nonIntegratedJournalsWithPackagesExitingBlackout

CurationReportGenerator

Write a class that generates CuratorReportData objects from real data in Dryad, given a begin date, and end date.

Details on where this data may come from - Do review existing code for overlaps:

  • beginDate - input / user specified
  • endDate - input / user specified
  • archivedIntegratedPackageCount
  1. Start with a list of Data Packages that are in the archive.
  2. Filter the list to packages for integrated journals. Journal integration status should be determined from a helper class (pre-existing) rather than loading DryadJournalSubmission.properties (since this is changing to database)
  3. Filter the list to the date range. Note that the dates are not readily available in database fields but need to be extracted out of dc.description.provenance metadata. DryadDataPackage has a method to get the submitted provenance but it would be a good idea to add a method to return the archived date provenance. Use a regex, get fancy!
  4. Count the items in the list
  • enteredBlackoutIntegratedPackageCount
  1. Start with a list of Data Package that are in publication blackout (See workflow state in database documentation).
  2. Filter the list to packages for non-integrated journals.
  3. Filter the list to the date range. See DryadDOIRegistrationHelper.java for example on extracting publication blackout entry. This may get complicated if the report should only return packages that are in publication blackout during the report dates. May make sense to add a method to DryadDataPackage e.g. statusAtDate(date) that returns the state of a data package (blackout, archive, curation, etc.) on a given date
  4. Count the items in the list
  • archivedNonIntegratedPackageCount - same as archivedIntegratedPackageCount but for journals that are not integrated
  • enteredBlackoutNonIntegratedPackageCount same as enteredBlackoutIntegratedPackageCount but for journals that are not integrated. If you copy and paste the code for all these 4 times you're doing it wrong.
  • inReviewPackageCount
  1. Start with a list of Data Package that are in review (See workflow documentation).
  2. Filter the list to the date range. Again, if a package entered review it will have a dc.description.provenance field that mentions the review step and the date.
  3. Count the items in the list
  • exitedBlackoutPackageCount
  1. Start with a list of Data Packages that entered the archive after being in publication blackout (see workflow documentation)
  2. Filter the list to the date range. This should only include items that were once in blackout but were archived during the date range
  3. Count the items in the list
  • newJournals - No pre-determined method, but one way would be to get the counts of items for all journals across all time, and identify the journals where the count in the current date range is the same. e.g 2 data packages total for Journal ABC and 2 data packages in the date range for Journal ABC means all of ABCs packages are new this week, so ABC should be in this list
  • nonIntegratedJournalsWithPackagesEnteringBlackout - Extract journal names from the list used in enteredBlackoutNonIntegratedPackageCount
  • nonIntegratedJournalsWithPackagesExitingBlackout - follow the logic in exitedBlackoutPackageCount, filtering for non-integrated journals, and extract the unique journal names.
  • nonIntegratedJournalsWithAlreadyPublishedArticles
  1. Start with a list of data packages in the archive
  2. Filter the list to packages for non-integrated journals
  3. Filter the list to packages archived within the date range
  4. Extract the journal names and unique the list.

ReportFormatter

Write classes that format IntegrationReportData and CurationReportData objects into a plaintext or html email for mailing. May be complicated/unnecessary by the DSpace Email class that uses templates, but it's important to be able to generate these reports without emailing them

ReportEmailer

Write a class that accepts a List of email addresses and a report (formatted or with a formatter), and sends the formatted report to the email address

Command-line interfaces

Write two classes with command-line interfaces (main methods) that do the following steps. In both cases the des

Integration Reports

  1. Accepts a beginDate and endDate as parameters
  2. Fetches the list of integrated journals in dryad
  3. For each journal:
  4. Fetch the list of notifyWeekly addresses (during development this should not be live journal personnel)
  5. Generate an IntegrationReportData using IntegrationReportGenerator
  6. Format and email the report to the notifyWeekly addresses

Curator Weekly Reports

  1. Accepts a beginDate and endDate as parameters
  2. Generates a CurationReportData using CurationReportGenerator
  3. Formats and emails the report to dryad-dev@googlegroups.com (address should actually be configured in maven profile)