Annual Statistics Reports

This page summarizes the statistics that are collected for annual reports. It describes the methods used to collect those statistics, so they can be collected in the same manner each year.

A note on dates
Although some of the stats use the word "submitted", we use the Date Accessioned to calculate all stats. This is the date at which an item entered the archive. For an item that is submitted to Dryad in late 2011, but not fully archived until early 2012, the statistics will count for 2012. This allows us to calculate consistent statistics without dealing with the complexities that happen before an item is archived, which include the review process and publication blackout.

Annual cleanup of manuscript metadata
delete from manuscript where date_added < '2018-01-01';
 * in Postgres, delete items older than 1 year:
 * in the Gmail account dryad.journal.submit@gmail.com, search for "older_than:365d", then hit the "select all" checkbox above the messages, click on "select all messages that match this search", and hit the trashcan icon to delete (may have to do this more than once, since there's a limit to how much you can delete in a single action)

Cleanup of bots/spiders
Before running stats, make sure that the bots/spiders have been cleaned. We do our best to catch bots on the fly, but it's always good to do a final cleanup.

/opt/dryad-utils/spider-cleaner/statsMostDownloadsOfFile.sh file_item_ID /opt/dryad/config/spiders/spidertrap-misc.txt /opt/dryad/bin/dspace dsrun org.dspace.statistics.util.StatisticsClient -u /opt/dryad/bin/dspace dsrun org.dspace.statistics.util.StatisticsClient -m /opt/dryad/bin/dspace dsrun org.dspace.statistics.util.StatisticsClient -i
 * Check the homepage popular list to see whether any of the top items look like they have excessive bot activity. This often looks like a data package that has more downloads than views, or where there are several files but one was downloaded much more than any of the others.
 * For any data files that seem suspicious, use their item ID to search solr for accesses and identify IP addresses with unusually high access numbers:
 * Add any problematic IP addresses to the spidertrap file on production:
 * Ensure the other IP address lists are up to date, then update the statistics index:

Basic submission stats
Run the datapackagestats report and process the results.


 * Packages submitted in the year
 * Number of items that have a dateaccessioned in 2012
 * Total packages as of year end
 * Number of items that have a dateaccessioned in 2012 or earlier
 * Files submitted in the year
 * For the items that have a dateaccessioned in 2012, the total number of their data files
 * Total files as of year end
 * For the items that have a dateaccessioned in 2012 or earlier, the total number of their data files
 * Volume of data (GB) submitted in the year
 * For the items that have a dateaccessioned in 2012, the total size of their data files
 * Total volume of data at year end
 * For the items that have a dateaccessioned in 2012 or earlier, the total size of their data files
 * Proportion of submissions that are:
 * integrated
 * proportion for which there is a manuscript number
 * pre-review
 * in the dryadassistant account, set up a query to find email notifications of review. "received for review" and appropriate dates.
 * This isn't very accurate because:
 * This counts the times when authors create multiple submissions for a single item.
 * The review number includes all items submitted to review purposes, but many of these may have their articles rejected.
 * The total submissions number includes only those submissions that end up in the archive.
 * With our current counting methods, we are not able to track a single submission through the process. That is, the review items are items submitted to the review stage in the year, while the archived items are items archived in the year. There are many items that enter the review stage in one year and are archived in a different year.
 * post-review (opposite of pre-review)
 * on with author lists differing between article and data (difficult to get w/o ORCID)
 * Proportion of files submitted this year that are:
 * embargoed (can it be limited to embargo-option journals?)
 * from dataPackageStats report, create a pivot table that uses the embargo settings as the row (and values) and uses the journalAllowsEmbargo as columns
 * alternate, but not normally used: run fileSimpleStats report for a count of embargo settings on data files)
 * Readmes
 * in datapackagestats, sort by date, then sum number of readmes and divide by the number of files
 * new versions -- while versioning system is being finalized, we just list how many are in waiting
 * each file type
 * run profileFormats report

Authors
select distinct text_value from metadatavalue where metadata_field_id=3 and item_id in (select item_id from collection2item where collection_id=2 and item_id in (select item_id from metadatavalue where metadata_field_id =11 and text_value > '2012-01-00' and text_value < '2013-01-00')); select distinct text_value from metadatavalue where metadata_field_id=3 and item_id in (select item_id from collection2item where collection_id=2); select count(*) from eperson;
 * Number of authors associated with submissions this year
 * Total number of authors represented in Dryad
 * see homepage, OR
 * Accounts created this year
 * can search email for subject "Registration Notification", but these messages aren't always saved.
 * Total number of accounts in Dryad
 * Distribution of packages by author over all yrs (list of top authors)
 * perform an empty search, and look at the author facet

Website Usage
curl "http://DRYAD_SERVER/solr/statistics/select/?q=-isBot:true+owningItem:%5B*%20TO%20*%5D&fl=time&rows=10000000" > downloads.txt grep -o "2012-" downloads.txt | wc curl "http://DRYAD_SERVER/solr/statistics/select/?q=-isBot:true+owningColl:2%20-owningItem:%5B*%20TO%20*%5D&fl=time&rows=100000000" > packageViews.txt grep -o "2012-" packageViews.txt | wc
 * Data file downloads (all file downloads in a given year)
 * get the full stats:
 * pull out the relevant download timestamps and count them:
 * Data package views (not working as of 2015-1-7)
 * get the full stats:
 * pull out the relevant download timestamps and count them:
 * Top downloaded packages
 * sort the dataPackageStats report by the "downloads" column
 * Note that this counts downloads for all time, not just during the time period. We need a better way to filter the downloads for the current year. Our 2013 report listed this as "Top five most downloaded data packages published in 2013"
 * web sessions, per month over time
 * In Google Analytics, set the reporting time to the correct timespan (Jan 1 - Dec 31)
 * read the values for sessions
 * web sessions, broken down by country, language
 * In Google Analytics, set the reporting time to the correct timespan (Jan 1 - Dec 31)
 * Look in the "Location" section
 * sources of visitors
 * create a report in Google Analytics

Overall stats
Ensure files in the dryad-data repository are up to date:
 * https://github.com/datadryad/dryad-data/blob/master/dataStorage.csv

Social media

 * Blog visits
 * Twitter followers

Journals

 * Currently integrated, in process, and on hold
 * see Submission Integration: Current Status and the Trello journal integration board
 * Breakdown of submissions by journal
 * run the DataPackagesPerJournal report (but first update the date window and recompile)
 * OR from the dataPackageStats report, create a pivot table -- use the journal names for both the rows of the table and the values area