Curator Report and Query Technology

From Dryad wiki
Jump to: navigation, search

Methods for generating various reports that are needed for Dryad presentations and documents.

DSpace Curation Reports

Many of the reports needed by curators can be run through the DSpace curation tasks system.


Summary Reports

Distribution of bitstream formats across all items in the Data Files collection:

/opt/dryad/bin/dspace curate -v -t profileformats -i 10255/2 -r - >~/temp/profileFormats.txt


Check that data files/packages have all of the required metadata:

/opt/dryad/bin/dspace curate -v -t requiredmetadata -i 10255/2 -r - >~/temp/requiredFileMetadata.txt
/opt/dryad/bin/dspace curate -v -t requiredmetadata -i 10255/3 -r - >~/temp/requiredFileMetadata.txt

Identifiers

  • datacitechecker
  • datacitesynchronizer

Data Files

Aggregate statistics about data files, including embargo status:

/opt/dryad/bin/dspace curate -v -t filestats -i 10255/2 -r - >~/temp/fileStats.txt

Locate embargoed files that do not yet have a designated end date for the embargo:

  • embargoedwithoutpubdate

Complete statistics about individual data files. This includes: handle, fileDOI, journal, journalAllowsEmbargo, journalAllowsReview, fileSize, embargoType, embargoDate, downloads, views, dateAccessioned.

/opt/dryad/bin/dspace curate -v -t datafilestats -i 10255/2 -r - >~/temp/dataFileStats.txt

Detailed download statistics for a particular file, identified by its itemID (in this example 40063):

http://datadryad.org/solr/statistics/select/?indent=on&rows=10000000&fl=continent,countryCode,city,time,userAgent&q=isBot:false+owningItem:40063

Data Packages

Aggregate statistics about data packages, including counts of integrated submissions:

/opt/dryad/bin/dspace curate -v -t packagestats -i 10255/3 -r - >~/temp/packageStats.txt

Simple/quick statistics about individual data packages. This includes: handle, doi, journal, dateAccessioned.

/opt/dryad/bin/dspace curate -v -t datapackageinfo -i 10255/3 -r - >~/temp/packageInfo.txt

Complete statistics about individual data packages. This includes: handle, doi, journal, numKeywords, numKeywordsJournal, numberOfFiles, packageSize, embargoType, numberOfDownloads, manuscriptNum, numReadmes, wentThroughReview, dateAccessioned.

/opt/dryad/bin/dspace curate -v -t datapackagestats -i 10255/3 -r - >~/tmp/dataPackageStats.txt

The report can be processed using grep to find the lines for the CSV file by typing the following from within the /tmp directory:

grep '^10255/dryad.' dataPackageStats.txt > dataPackageStats.csv

Journals

Dates for the first archived item associated with each journal in the repository:

/opt/dryad/bin/dspace curate -v -t journaldates -i 10255/3 -r - >~/temp/journalDates.txt

Number of data packages for each journal in the repository:

/opt/dryad/bin/dspace curate -v -t datapackagesperjournal -i 10255/3 -r - >~/temp/journalPackages.txt

List of all journals associated with archived items:

\copy (select distinct text_value from metadatavalue where metadata_field_id =77 and item_id IN (select item_id from item where in_archive=TRUE and item_id IN (select item_id from collection2item where collection_id=2))) to '/home/dryad/currentJournals.txt' with CSV;

List of all journals associated with items in publication blackout:

\copy (select text_value from metadatavalue where metadata_field_id =77 and item_id in (select item_id from metadatavalue where text_value like '%Entered publication blackout%' and item_id not in (select item_id  from metadatavalue where text_value like '%Approved for entry into archive%'))) to '/home/dryad/allJournalsBlackout.txt' with csv;

Items in curation

Items in the review workflow stage:

/opt/dryad/bin/dspace curate -v -t itemsinreview -i 10255/3 -r - >~/temp/itemsInReview.csv

Graphing Dryad submissions over time

First, get all of the deposit dates. On the production server:

postgres-client.sh -c "select text_value from metadatavalue where metadata_field_id=12 and item_id in (select item_id from item where owning_collection=2 and in_archive='t');" > dryadSubmitDates.txt

(collection 2 is "Data Packages" and metadata field 12 is "Date Available")

Edit the file to remove the timestamps (leave only the date portion).

Sort the dates.

Import the dryadSubmitDates.txt into Excel.

Create a column beside the dates that starts at 1 and counts up (this represents the total number of submissions present on each day).

Create a graph of these two columns.

Items in publication blackout

Items in blackout are owned by "Dryad Queue". The queue's ePerson ID is 949.

select item_id from workflowitem where workflow_id IN (select workflow_item_id from taskowner where owner_id=949);

To get metadata about items in the blackout, you must select the metadata directly from the database, one field at a time. Below is a query to select metadata field 17 (DOI). When performing these types of queries, always order by item_id to keep the ordering of the metadata values consistent across queries. select text_value from metadatavalue where metadata_field_id=17 and item_id IN (select item_id from workflowitem where workflow_id IN (select workflow_item_id from taskowner where owner_id=949)) ORDER BY item_id;

Submissions from a particular instituiton

This query will identify archived data packages that are associated with a particular email suffix. Note that this undercounts the number of deposits from an institution, since around 25% of Dryad user create accounts with a generic email address (e.g., GMail).

select item_id from item where submitter_id IN (select eperson_id from eperson where email like '%TARGET_INSTITUTION_EMAIL_SUFFIX') and in_archive='t' and item_id IN (select item_id from collection2item where collection_id=2) order by last_modified;

Vouchers

\copy (select voucher.code as voucher_code, shoppingcart.status, voucher.customer, metadatavalue.text_value as DOI, metadatavalue.item_id, shoppingcart.notes
from voucher, shoppingcart, metadatavalue
where
shoppingcart.voucher >0
and shoppingcart.voucher = voucher.voucher_id
and shoppingcart.item = metadatavalue.item_id
and metadatavalue.metadata_field_id = 17) to '/tmp/voucher_export.csv' with csv;

Item Metadata

select element, qualifier, text_value
from metadatavalue, metadatafieldregistry
where metadatavalue.metadata_field_id = metadatafieldregistry.metadata_field_id
and item_id =145629;

Items That Have Been Paid, but are Incorrectly in Reauthorization

select item_id from workflowitem, taskowner, shoppingcart where workflowitem.workflow_id = taskowner.workflow_item_id and (taskowner.step_id like 'reAuthorizationPaymentStep' or taskowner.step_id like 'pendingPublicationReAuthorizationPaymentStep') and workflowitem.item_id=shoppingcart.item and shoppingcart.status like 'completed';

See Also

  • Annual Statistics Reports -- describes the statistics we collect for annual reports, and how they are collected
  • Statistics Technology -- describes how the statistics tracking system works, including how to access file download statistics
  • Batch Metadata Editing -- describes a process for exporting metadata, which can be used for reporting