Statistics Technology

From Dryad wiki
Jump to: navigation, search

Dryad provides a statistical display that takes into consideration Dryad's data package / data file distinction. For Dryad, there is the concept of a data package that contains all the data files (and other related files) associated with a particular publication. Each Dryad data package may have a relationship to multiple data files. Dryad wants to display statistical information about both levels of information: the data package and the data files.

Functionality

On the home page, we display overall statistics.

We also display a "viewed" and "downloaded" count for each package and file (on both the Dryad Data Package and Dryad Data File item view pages).

Workflow

DSpace stores its usage statistics in a Solr index called "statistics" (it has a Web-based administrative interface through which test queries can be run). Records stored in this index have an `owningColl` (owning collection) and `id`. Records for bitstreams will have an `owningItem` (owning item). Dryad doesn't attach bitstreams directly to data packages, so when looking for download statistics, search by owningItem using the internal ID of the associated data file object.

Relationships between data file and data packages are not stored in the statistics index (since this is a concept overlaid onto DSpace by Dryad). These relationships though can be gleaned from the item records in the "search" Solr index. The "search" index is where general searching takes place and the relationships between records are stored there as linked identifiers. The "search" index also has a Web-based administrative interface through which test searches can be performed.

Since Dryad stores data packages and data files in separate collections, overall statistics (like the type displayed on the Dryad home page) can be gathered by checking the total number of active records in each collection. The additional information of how many unique journals have publications represented in Dryad is determined by querying the unique names of journals in the prism.publicationName field of the "search" Solr index.

All these statistical functions are performed in Java code which then writes a value into the `pageMeta` (page metadata). The Dryad XSLT theme then takes these values and displays them on the appropriate page when that page is rendered into HTML.

The Java classes related to the gathering of these Dryad specific statistics are kept in the Dryad overlay of the xmlui module. The classes have been put into the org.datadryad.dspace.statistics package to indicate they are specific to Dryad and not just a slight modification of an existing DSpace function. Here are the classes used and a brief explanation of each:

org.datadryad.dspace.statistics.SiteOverview

  • Generates the overall statistical summary displayed on home page

org.datadryad.dspace.statistics.ItemStatsOverview

  • Pulls together the stats generated from the !ItemPkgStats and !ItemFileStats and puts them into the page metadata
  • Caches the statistical generation so it doesn't need to be run at each page visit

org.datadryad.dspace.statistics.ItemPkgStats

  • Generates statistics for individual data packages

org.datadryad.dspace.statistics.ItemFileStats

  • Generates statistics for individual data files

For more details, consult the Java classes in the org.datadryad.dspace.statistics package directly.

The XSLT code that pulls the individual data package and data file statistics from the page metadata for display on the page can be found in the DryadItemSummary.xsl file. The XSLT code that displays the site's overall statistical information can be found in the default Dryad.xsl file.

Configuration

There is a minimal amount of configuration needed in the dspace.cfg file for the Dryad statistics to work. They, of course, rely on the locations of the Solr server to be set via the `solr.log.server` and `solr.search.server` variables. The Dryad statistics code also requires two additional variables be set `stats.datafiles.coll` and `stats.datapkgs.coll` to indicate which collections contain the files and data packages.

# The handle for the data package collection's stats (also used by DOI minter)
stats.datapkgs.coll = 10255/3
# The handle for the data file collection's stats (also used by DOI minter)
stats.datafiles.coll = 10255/2

Dryad sets the Solr server variables in the dspace.cfg file rather than the dspace-solr-search.cfg file so that we can use the standard Maven profiles mechanism to override these variables, depending on which Dryad instance we're running (demo, dev, production, etc.)

Statistics on the Statistics

All statistics are stored in SOLR.

An easy way to process stats (outside of DSpace) is to construct a SOLR query, curl it into a file, and then process the file to get the needed information.

File/Package views can be found by searching for field "id" with the item's internal DSpace item_id. File downloads can be found by searching for field "owningItem" with the item's internal DSpace item_id.

Sample requests

The statistics are based on Dryad's internal item_id for each item, not the DOI.

Downloads of a particular data file (not filtering out web spiders):

http://datadryad.org/solr/statistics/select/?indent=on&q=owningItem:42884&rows=10000000

WARNING: if you attempt to follow these links in a web browser, they will return very large results. To prevent accidental usage, we have replaced datadryad.org with DRYAD_SERVER.

All downloads of data files from a particular year (year= 2017, owning item = anything, only return timestamps):

http://DRYAD_SERVER/solr/statistics/select/?q=owningItem:%5B*%20TO%20*%5D+time:2017-*&fl=time&rows=10000000

All views of data package pages (owning collection = 2, there is no owning item present, only return timestamps):

http://DRYAD_SERVER/solr/statistics/select/?q=isBot:false+owningColl:2%20-owningItem:%5B""%20TO%20*%5D&fl=time&rows=100000000

To bin these by month, curl the URL into a file, grep for the month (e.g., "2011-10"), and pipe to wc.

Managing Statistics

Statistics can be managed with the StatisticsClient command-line tool:

/opt/dryad/bin/dspace dsrun org.dspace.statistics.util.StatisticsClient

The StatisticsClient has options for managing the entries in the statistics index, including:

 -u,--update-spider-files         Update Spider IP Files from internet into /opt/dryad/config/spiders
 -f,--delete-spiders-by-flag      Delete Spiders in Solr By isBot Flag
 -i,--delete-spiders-by-ip        Delete Spiders in Solr By IP Address
 -r,--delete-review-identifiers   Delete ip/dns/token from downloads of in-review items
 -m,--mark-spiders                Update isBot Flag in Solr
 -h,--help                        help

Relation to DSpace

Dryad's statistical display relies on the statistics log index created by the standard DSpace Statistics module. It also relies on the "search" index created by the Discovery module. If Solr index fields change in the Discovery module, this may result in the custom Dryad statistical functionality to break (for instance if the publication names are put into a different index or if the relationships between files and packages are stored differently).

See Also