Statistics Technology

From Dryad wiki
Revision as of 17:16, 26 January 2012 by Ryan Scherle (talk | contribs) (Configuration)

Jump to: navigation, search

Dryad provides a statistical display that takes into consideration Dryad's data package / data file distinction. For Dryad, there is the concept of a data package that contains all the data files (and other related files) associated with a particular publication. Each Dryad data package may have a relationship to multiple data files. Dryad wants to display statistical information about both levels of information: the data package and the data files.

Functionality

On the home page, we want to display an overall message about the number of data packages and data files in the form of: "As of Apr 25, 2011, Dryad contains 596 data packages and 1445 data files, associated with articles in 67 journals." The date should be the date on which the page is being viewed.

We also want to display a "viewed" and "downloaded" count for each package and file (on both the Dryad Data Package and Dryad Data File item view pages). Currently, only data files are downloaded, but in the future we will allow downloads of complete packages in a compressed file format. For now, packages display a number of times they have been viewed and files display a number of times they've been downloaded and/or viewed.

For example:

Workflow

DSpace stores its usage statistics in a Solr index called "statistics" (it has a Web-based administrative interface through which test queries can be run). Records stored in this index have an `owningColl` (owning collection) and `id`. Some with have an `owningItem` (owning item). Records that have an owning item are records for bitstreams (indicating a download) that belong to a data file DSpace item. Dryad doesn't attach bitstreams directly to data packages so any item that has been downloaded is associated with the data file.

Relationships between data file and data packages are not stored in the statistics index (since this is a concept overlaid onto DSpace by Dryad). These relationships though can be gleaned from the item records in the "search" Solr index. The "search" index is where general searching takes place and the relationships between records are stored there as linked identifiers. The "search" index also has a Web-based administrative interface through which test searches can be performed.

Since Dryad stores data packages and data files in separate collections, overall statistics (like the type displayed on the Dryad home page) can be gathered by checking the total number of active records in each collection. The additional information of how many unique journals have publications represented in Dryad is determined by querying the unique names of journals in the prism.publicationName field of the "search" Solr index.

All these statistical functions are performed in Java code which then writes a value into the `pageMeta` (page metadata). The Dryad XSLT theme then takes these values and displays them on the appropriate page when that page is rendered into HTML.

The Java classes related to the gathering of these Dryad specific statistics are kept in the Dryad overlay of the xmlui module. The classes have been put into the org.datadryad.dspace.statistics package to indicate they are specific to Dryad and not just a slight modification of an existing DSpace function. Here are the classes used and a brief explanation of each:

org.datadryad.dspace.statistics.SiteOverview

  • Generates the overall statistical summary displayed on home page

org.datadryad.dspace.statistics.ItemStatsOverview

  • Pulls together the stats generated from the !ItemPkgStats and !ItemFileStats and puts them into the page metadata
  • Caches the statistical generation so it doesn't need to be run at each page visit

org.datadryad.dspace.statistics.ItemPkgStats

  • Generates statistics for individual data packages

org.datadryad.dspace.statistics.ItemFileStats

  • Generates statistics for individual data files

For more details, consult the Java classes in the org.datadryad.dspace.statistics package directly.

The XSLT code that pulls the individual data package and data file statistics from the page metadata for display on the page can be found in the DryadItemSummary.xsl file. The XSLT code that displays the site's overall statistical information can be found in the default Dryad.xsl file.

Configuration

There is a minimal amount of configuration needed in the dspace.cfg file for the Dryad statistics to work. They, of course, rely on the locations of the Solr server to be set via the `solr.log.server` and `solr.search.server` variables. The Dryad statistics code also requires two additional variables be set `stats.datafiles.coll` and `stats.datapkgs.coll` to indicate which collections contain the files and data packages.

# The handle for the data package collection's stats (also used by DOI minter)
stats.datapkgs.coll = 10255/3
# The handle for the data file collection's stats (also used by DOI minter)
stats.datafiles.coll = 10255/2

Dryad sets the Solr server variables in the dspace.cfg file rather than the dspace-solr-search.cfg file so that we can use the standard Maven profiles mechanism to override these variables, depending on which Dryad instance we're running (demo, dev, production, etc.)

Statistics on the Statistics

All statistics are stored in SOLR.

An easy way to process stats (outside of DSpace) is to construct a SOLR query, curl it into a file, and then process the file to get the needed information.

File/Package views can be found by searching for field "id" with the item's internal DSpace item_id. File downloads can be found by searching for field "owningItem" with the item's internal DSpace item_id.

Sample requests

WARNING: if you attempt to follow these links in a web browser, they will return very large results.
To prevent accidental usage, we have replaced datadryad.org with DRYAD_SERVER.

All downloads of data files (owning item = anything, request is not a bot, only return timestamps):

http://DRYAD_SERVER/solr/statistics/select/?indent=on&q=isBot:false+owningItem:%5B*%20TO%20*%5D&fl=time&rows=10000000

All views of data package pages (owning collection = 2, there is no owning item present, only return timestamps):

http://DRYAD_SERVER/solr/statistics/select/?indent=on&q=isBot:false+owningColl:2%20-owningItem:%5B""%20TO%20*%5D&fl=time&rows=100000000

To bin these by month, curl the URL into a file, grep for the month (e.g., "2011-10"), and pipe to wc.

Relation to DSpace

Dryad's statistical display relies on the statistics log index created by the standard DSpace Statistics module. It also relies on the "search" index created by the Discovery module. If Solr index fields change in the Discovery module, this may result in the custom Dryad statistical functionality to break (for instance if the publication names are put into a different index or if the relationships between files and packages are stored differently).

See Also

  • Reporting -- describes how reports are generated based on the statistics